Escolar Documentos
Profissional Documentos
Cultura Documentos
Elena Capatina elena.statahelp@gmail.com Office hours: Mondays 2-4pm, GE313 Wed 2-4pm, GE213 (starting Feb25th)
or
http://www.utoronto.ca/ic/software/detail/stata.html
Finding Data
http://datacentre.chass.utoronto.ca/
For some data, need to access from campus, sometimes need to go in person to the datacentre Access off-campus using UTORvpn (download from library website)
Data on CHASS
Canadian data:
CANSIM Census Analyser
Data on CHASS
Financial markets data:
CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies Canadian Financial Markets Research Centre Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database
Data on CHASS
Companies financial data:
Financial Post Corporate Database COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies
STATA windows
The command window The viewer/results window The review of commands window The variable window
The do file
A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as filename.do for STATA to read it From the STATA do-file editor, click do for STATA to execute all commands Can highlight and click do to execute only the highlighted command lines
Type of commands
1. Administrative commands that tell STATA where to save results, how to manage computer memory, and so forth 2. Commands that tell STATA to read and manage datasets 3. Commands that tell STATA to modify existing variables or to create new variables 4. Commands that tell STATA to carry out the statistical analysis
Example: stata1.do
clear log using "C:\Users\Elena\Documents\Various\STATATA\stata1.log", replace use "C:\Users\Elena\Documents\Various\STATATA\caschool.dta" describe generate income = avginc*1000 summarize income log close exit
infix command
Use if you prefer to copy and paste your dictionary in a do file
Useful Commands:
describe :
STATA will list all the variables, their labels, types, and tell you the # of observations
More commands:
generate or gen
Creates a new variable i.e. generate income = avginc*1000 i.e. generate log_inc = log(income) i.e. gen inc_sq = (income)^2
More commands:
summarize
tells STATA to compute summary statistics (mean, standard deviations, and so forth) for all variables Useful to identify outliers and get an idea of your data i.e. summarize i.e. summ income inc_sq
Example: stata2.do
# delimit ; * Administrative Commands; set more off; clear; log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace * Read in the Dataset; use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta" describe; * Transform data and Create New Variables; **** Construct average district income in $'s; generate income = avginc*1000; * Carry Out Statistical Analysis; ***** Summary Statistics for Income; summarize income; * End the Program ; log close; exit;
Useful commands
# delimit ;
tells STATA that each STATA command ends with a semicolon. Useful for long commands Do not forget the ; and write this even after the comment lines that start with *.
Useful Commands
set more off
Ensures STATA executes all commands. Otherwise, if your code is too long, the output window might be filled, and STATA will display --more-- at the bottom, not executing all commands
Increasing memory
set memory 600m You may also need to increase the number of variables allowed by Stata: cannot be done with IC Stata
Other commands
tabulate
i.e. tabulate county Shows the frequency and percent of each value of county in the dataset
The if command
i.e. generate teachers_new= teachers if teachers<=10 replace teachers_new=0 if teachers>10 i.e. summarize teachers if county== Nevada
Operators
< less than > greater than <= less than or equal to >= greater than or equal to == equal to ~= not equal to
The by command
i.e. by county, sort: summarize income
Combining datasets
merge command
use "My Statistics\_respondent.dta", clear; sort ID; merge ID using "My Statistics\_annualfile.dta"; sort ID year; merge (ID year) using "temp1.dta";
Statistical relationships
1. Correlations: correlate
i.e. correlate income teachers i.e. correlate income teachers computers
2. Regressions: reg
i.e. reg income teachers i.e. reg income teachers computers
Graphs
Scatter Plots
i.e. twoway (scatter income computer)
Loops
forvalues Generate 100 uniform random variables named x1, x2, ..., x100: forvalues i = 1(1)100 { generate x`i' = uniform() } Divide a dataset into two datasets, each with a different education forvalues e=1/2{; use "My Statistics\Maleyearly.dta", clear; keep if education==`e'; save "My Statistics\Males_`e'.dta", replace; }
Collapse command
Creates a new dataset with the specified variables summarizing current data
i.e. collapse (mean) no_kids, by (education age status);
You can export your data in another format from File , then Export , then choose the type of file you want.
Wide data:
id 1 2 3 4 5 6
grp 1 1 2 2 3 3
lrn95 7 13 15 21 9 16
lrn96 12 14 14 27 9 17
lrn97 16 15 20 24 12 18
Reshape
Example of long data:
id 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
year 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97
grp 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
lrn 7 12 16 13 14 15 15 14 20 21 27 24 9 9 12 16 17 18
Reshape
From wide to long:
i.e. reshape long lrn, i(id) j(year)
Source: http://www.ats.ucla.edu/stat/stata/notes/reshape.htm
tabstat command
i.e. by ay: tabstat N if INC==2 & education1==1, s(n mean max min p50 p25 p75);
egen command
Extended generate command.
More powerful than generate
Examples:
egen age_mean = mean(age), by(year) egen totalsum = total(x) egen stdage = std(age)
Lagged variables
[_n-1] tells STATA this is the previous observation [_n-2] is 2 observations before Examples: (assuming data is sorted)
gen GDP_lagged= GDP[_n-1] gen GDP_2= GDP[_n-2]
Shortcut: local
Example using local :
local t = 80; while `t'<=98{; use "Tagsets\status_`t'.dta", clear; do "rename_status.do"; save "weeklyfile_`t'.dta"; local t = `t'+2; };
Collapse command
Initial data:
famid 1 1 1 2 2 2 3 3 kidname birth Beth Bob Barb Andy Al Ann Pete Pam age 1 2 3 1 2 3 1 2 9 6 3 8 6 2 6 4 wt 60 40 20 80 50 20 60 40
create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family, and the number of kids (numkids) per family
Source: http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm
Tests of significance
i.e. ttest sysbp = 122.3 , level(95)
computes the sample mean and standard deviation of the variable sysbp, computes a t-test that the population mean is equal to 122.3, and computes a 95% confidence interval for the population mean
Source: mhtml:http://www.biostat-edu.com/files/Stata_Program_Notes_Chapter_8posted.mht
STATA output
. ttest sysbp = 122.3; One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------sysbp | 199 125.8241 1.288642 18.17853 123.2829 128.3653 -----------------------------------------------------------------------------mean = mean(sysbp) t = 2.7348 Ho: mean = 122.3 degrees of freedom = 198 Ha: mean < 122.3 Pr(T < t) = 0.9966 Ha: mean != 122.3 Pr(|T| > |t|) = 0.0068 Ha: mean > 122.3 Pr(T > t) = 0.0034
ttest testscr_lo=testscr_hi, unequal unpaired; Two-sample t test with unequal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------testsc~o | 238 657.3513 1.254794 19.35801 654.8793 659.8232 testsc~i | 182 649.9788 1.323379 17.85336 647.3676 652.5901 ---------+-------------------------------------------------------------------combined | 420 654.1565 .9297082 19.05335 652.3291 655.984 ---------+-------------------------------------------------------------------diff | 7.37241 1.823689 3.787296 10.95752 -----------------------------------------------------------------------------diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426 Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607 Ha: diff < 0 Pr(T < t) = 1.0000 Ha: diff != 0 Ha: diff > 0 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000
Simple regression
regress science math female socst read
Source | SS df MS -------------+-----------------------------Model | 9543.72074 4 2385.93019 Residual | 9963.77926 195 51.0963039 -------------+-----------------------------Total | 19507.5 199 98.0276382 Number of obs = 200 F( 4, 195) = 46.69 Prob > F = 0.0000 R-squared = 0.4892 Adj R-squared = 0.4788 Root MSE = 7.1482
-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 -----------------------------------------------------------------------------Source: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm
t and P>|t| - These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0. [95% Conf. Interval] - This shows a 95% confidence interval for the coefficient. (the coefficient will not be statistically significant if the confidence interval includes 0)
Predicted values
After the regression, type predict yhat Creates a new variable yhat with the predicted values for the dependant variable
i.e. estat imtest (White test) i.e. estat hettest (Breusch-Pagan test)
Http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm http://www.nd.edu/~rwilliam/stats2/l25.pdf
-----------------------------------------------------------------------------| Robust teachers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------meal_pct | .8239426 .336556 2.45 0.015 .1623848 1.4855 expn_stu | -.026066 .0098221 -2.65 0.008 -.045373 -.0067591 _cons | 230.7061 59.44205 3.88 0.000 113.8627 347.5496 ------------------------------------------------------------------------------
cluster commad
For example, you might think that in a panel of countries, errors are correlated across time but independent across countries. Then, you should cluster standard errors on countries.
i.e. regress y k, cluster(country)
xt are the prefix for the commands in this class xtreg should be used for regressions with panel data
Fixed effects:
yit = a + xitb + vi + eit i.e. xtreg lnc lny, fe Equivalent to including a dummy variable for each case (i.e. firm).
Useful link
http://www.iies.su.se/~masa/stata.htm
This contains links to other STATA websites by topic
http://www.princeton.edu/~erp/stata/main.html http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm