Tutorial

STATA Tutorial- MFE
Elena Capatina elena.statahelp@gmail.com Office hours: Mondays 2-4pm, GE313 Wed 2-4pm, GE213 (starting Feb25th)
Stata 10: How to get it

Buy from Stata website, pick up at Robarts:
http://www.stata.com/order/new/edu/gradplans/cgpcampusorder.html
or
http://www.utoronto.ca/ic/software/detail/stata.html
Finding Data
http://datacentre.chass.utoronto.ca/
For some data, need to access from campus, sometimes need to go in person to the datacentre Access off-campus using UTORvpn (download from library website)
Data library 5th floor Robars
Data on CHASS
Canadian data:
CANSIM Census Analyser
International Trade and Finance data:

Trade analyser: by partner, commodity, on M and X IMF trade data
Data on CHASS
Financial markets data:
CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies Canadian Financial Markets Research Centre Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database
Data on CHASS
Companies financial data:
Financial Post Corporate Database COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies
National income statistics:

OECD National Accounts Database World Bank databases Penn World Tables
STATA windows
The command window The viewer/results window The review of commands window The variable window
Working with STATA

1. From the command window 2. Using a do file
The do file
A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as filename.do for STATA to read it From the STATA do-file editor, click do for STATA to execute all commands Can highlight and click do to execute only the highlighted command lines
Data editor/data browser

Shows you your data Check this frequently, especially after commands you are unsure about
Type of commands
1. Administrative commands that tell STATA where to save results, how to manage computer memory, and so forth 2. Commands that tell STATA to read and manage datasets 3. Commands that tell STATA to modify existing variables or to create new variables 4. Commands that tell STATA to carry out the statistical analysis
Example: stata1.do
clear log using "C:\Users\Elena\Documents\Various\STATATA\stata1.log", replace use "C:\Users\Elena\Documents\Various\STATATA\caschool.dta" describe generate income = avginc*1000 summarize income log close exit
The log using command

The log file is an output file Creates and saves a log with all the actions performed by STATA and all the results How to view it later?
In Stata, go to File , then log , view , and search for your filename, keeping in mind it has extension .log
Loading your data

If your data is in STATA format, ie filename.dta , then enter: use filename.dta If your data is a comma delimited file: insheet using filename.txt For other formats, can use StatTransfer to convert to STATA format
Using a dictionary file

A dictionary file reads data with extensions .raw ( .dat too) infile command
i.e. infile using dictionaryname.dct
infix command
Use if you prefer to copy and paste your dictionary in a do file
Useful Commands:
describe :
STATA will list all the variables, their labels, types, and tell you the # of observations
Two types of variables:

1. Numerical 2. String (usually appear in red in the data browser) You can convert a string variable to numerical using the destring command: ie. destring var1, replace or destring var1, force replace
More commands:
generate or gen
Creates a new variable i.e. generate income = avginc*1000 i.e. generate log_inc = log(income) i.e. gen inc_sq = (income)^2
More commands:
summarize
tells STATA to compute summary statistics (mean, standard deviations, and so forth) for all variables Useful to identify outliers and get an idea of your data i.e. summarize i.e. summ income inc_sq
Ending the do file

log close closes the file stata1.log that contains the output. The command exit tells STATA that the program has ended.
Example: stata2.do
# delimit ; * Administrative Commands; set more off; clear; log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace * Read in the Dataset; use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta" describe; * Transform data and Create New Variables; **** Construct average district income in $'s; generate income = avginc*1000; * Carry Out Statistical Analysis; ***** Summary Statistics for Income; summarize income; * End the Program ; log close; exit;
Comments in your do file:

Asterisk: STATA ignores the text that comes after * (does not execute them) these lines can be used to describe what the commands are doing, or allows you to write comments.
i.e. * Administrative Commands
Useful commands
# delimit ;
tells STATA that each STATA command ends with a semicolon. Useful for long commands Do not forget the ; and write this even after the comment lines that start with *.
Useful Commands
set more off
Ensures STATA executes all commands. Otherwise, if your code is too long, the output window might be filled, and STATA will display --more-- at the bottom, not executing all commands
Increasing memory
set memory 600m You may also need to increase the number of variables allowed by Stata: cannot be done with IC Stata
My typical admin commands

clear #delimit ; set more off; set mem 1200000; cap log close; cd "C:\NLSY_1\mainfiles\"; log using "Logs\Calibration_statistics.log", replace;
Other commands
tabulate
i.e. tabulate county Shows the frequency and percent of each value of county in the dataset
The if command
i.e. generate teachers_new= teachers if teachers<=10 replace teachers_new=0 if teachers>10 i.e. summarize teachers if county== Nevada
Operators
< less than > greater than <= less than or equal to >= greater than or equal to == equal to ~= not equal to
Sorting the data

sort
i.e. sort income i.e. sort county income
The by command
i.e. by county, sort: summarize income
Deleting variables and observations

drop
i.e. drop avginc - this drops the variable acginc - i.e. drop if teachers<=5
- this deletes only the observations for which teachers is less than 5.
Deleting variables and observations

Keep
i.e. keep if teachers>=7
Combining datasets
merge command
use "My Statistics\_respondent.dta", clear; sort ID; merge ID using "My Statistics\_annualfile.dta"; sort ID year; merge (ID year) using "temp1.dta";
Statistical relationships
1. Correlations: correlate
i.e. correlate income teachers i.e. correlate income teachers computers
2. Regressions: reg
i.e. reg income teachers i.e. reg income teachers computers
Graphs
Scatter Plots
i.e. twoway (scatter income computer)
Loops
forvalues Generate 100 uniform random variables named x1, x2, ..., x100: forvalues i = 1(1)100 { generate xì' = uniform() } Divide a dataset into two datasets, each with a different education forvalues e=1/2{; use "My Statistics\Maleyearly.dta", clear; keep if education==è'; save "My Statistics\Males_è'.dta", replace; }
Collapse command
Creates a new dataset with the specified variables summarizing current data
i.e. collapse (mean) no_kids, by (education age status);
Saving your data

Saving in Stata format:
i.e. save file name.dta
You can export your data in another format from File , then Export , then choose the type of file you want.
More on data cleaning

reshape
From long to wide or from wide to long Example:
Wide data:
id 1 2 3 4 5 6
grp 1 1 2 2 3 3
lrn95 7 13 15 21 9 16
lrn96 12 14 14 27 9 17
lrn97 16 15 20 24 12 18
Reshape
Example of long data:
id 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
year 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97
grp 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
lrn 7 12 16 13 14 15 15 14 20 21 27 24 9 9 12 16 17 18
Reshape
From wide to long:
i.e. reshape long lrn, i(id) j(year)
From long to wide:

i.e. reshape wide lrn, i(id) j(year)
Source: http://www.ats.ucla.edu/stat/stata/notes/reshape.htm
tabstat command
i.e. by ay: tabstat N if INC==2 & education1==1, s(n mean max min p50 p25 p75);
egen command
Extended generate command.
More powerful than generate
Examples:
egen age_mean = mean(age), by(year) egen totalsum = total(x) egen stdage = std(age)
Lagged variables
[_n-1] tells STATA this is the previous observation [_n-2] is 2 observations before Examples: (assuming data is sorted)
gen GDP_lagged= GDP[_n-1] gen GDP_2= GDP[_n-2]
Other uses for [_n-1]

Filling in missing data i.e. by ID: replace education=1 if education[_n-1]==1 & education[_n+1]==1 & ID[_n-1]==ID[_n+1];
Shortcut: local
Example using local :
local t = 80; while `t'<=98{; use "Tagsets\status_`t'.dta", clear; do "rename_status.do"; save "weeklyfile_`t'.dta"; local t = `t'+2; };
Collapse command
Initial data:
famid 1 1 1 2 2 2 3 3 kidname birth Beth Bob Barb Andy Al Ann Pete Pam age 1 2 3 1 2 3 1 2 9 6 3 8 6 2 6 4 wt 60 40 20 80 50 20 60 40
New data after collapse

collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
famid avgage 1 6 2 5.333333 3 4 avgwt 40 50 40 numkids 3 3 3
create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family, and the number of kids (numkids) per family
Source: http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm
preserve and restore

preserve tells STATA to keep your data in memory, so if your next commands modify it, you can come back to your original data Example:
use data1.dta preserve collapse (mean) age, by (family) save data2.dta restore
Tests of significance
i.e. ttest sysbp = 122.3 , level(95)
computes the sample mean and standard deviation of the variable sysbp, computes a t-test that the population mean is equal to 122.3, and computes a 95% confidence interval for the population mean
Source: mhtml:http://www.biostat-edu.com/files/Stata_Program_Notes_Chapter_8posted.mht
STATA output
. ttest sysbp = 122.3; One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------sysbp | 199 125.8241 1.288642 18.17853 123.2829 128.3653 -----------------------------------------------------------------------------mean = mean(sysbp) t = 2.7348 Ho: mean = 122.3 degrees of freedom = 198 Ha: mean < 122.3 Pr(T < t) = 0.9966 Ha: mean != 122.3 Pr(|T| > |t|) = 0.0068 Ha: mean > 122.3 Pr(T > t) = 0.0034
Testing if means are equal

ttest testscr_lo=testscr_hi, unequal unpaired
test the hypothesis that testscr_lo and testscr_hi come from populations with the same mean. computes the t-statistic for the null hypothesis that the mean of testscr_lo is the same as the mean of testscr_hi unequal tells STATA that the variances in the two populations may not be the same. unpairedtells STATA that the observations are for different districts
ttest testscr_lo=testscr_hi, unequal unpaired; Two-sample t test with unequal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------testsc~o | 238 657.3513 1.254794 19.35801 654.8793 659.8232 testsc~i | 182 649.9788 1.323379 17.85336 647.3676 652.5901 ---------+-------------------------------------------------------------------combined | 420 654.1565 .9297082 19.05335 652.3291 655.984 ---------+-------------------------------------------------------------------diff | 7.37241 1.823689 3.787296 10.95752 -----------------------------------------------------------------------------diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426 Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607 Ha: diff < 0 Pr(T < t) = 1.0000 Ha: diff != 0 Ha: diff > 0 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000
Simple regression
regress science math female socst read
Source | SS df MS -------------+-----------------------------Model | 9543.72074 4 2385.93019 Residual | 9963.77926 195 51.0963039 -------------+-----------------------------Total | 19507.5 199 98.0276382 Number of obs = 200 F( 4, 195) = 46.69 Prob > F = 0.0000 R-squared = 0.4892 Adj R-squared = 0.4788 Root MSE = 7.1482
-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 -----------------------------------------------------------------------------Source: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm
Reading the output table

SSTotal --The total variability around the mean. S(Y - Ybar)2. SSResidual --The sum of squared errors: S(Y Ypredicted)2 SSModel -- SSTotal - SSResidual. Note that SSModel / SSTotal is equal to .4892, the value of R-Square ( =proportion of the variance explained by the independent variables)

df - These are the degrees of freedom associated with the sources of variance. MS - These are the Mean Squares, the Sum of Squares divided by their respective DF.

Coefficients:
sciencePredicted = 12.32529 + .3893102*math + -2.009765*female +.0498443*socst+.3352998*read
t and P>|t| - These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0. [95% Conf. Interval] - This shows a 95% confidence interval for the coefficient. (the coefficient will not be statistically significant if the confidence interval includes 0)
Predicted values
After the regression, type predict yhat Creates a new variable yhat with the predicted values for the dependant variable
Saving the residuals

predict r, residuals Checking homoskedasticity of residuals i.e. rvfplot, yline(0)
Plots the residuals against the predicted values
i.e. estat imtest (White test) i.e. estat hettest (Breusch-Pagan test)
Http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm http://www.nd.edu/~rwilliam/stats2/l25.pdf
Linear regression with heteroskedastic errors

Need robust standard errors (Huber-White): - use the robust option with regress
i.e. reg teachers meal_pct expn_stu, robust
Linear regression Number of obs = 420 F( 2, 417) = 9.58 Prob > F = 0.0001 R-squared = 0.0232 Root MSE = 186.17
-----------------------------------------------------------------------------| Robust teachers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------meal_pct | .8239426 .336556 2.45 0.015 .1623848 1.4855 expn_stu | -.026066 .0098221 -2.65 0.008 -.045373 -.0067591 _cons | 230.7061 59.44205 3.88 0.000 113.8627 347.5496 ------------------------------------------------------------------------------
Note: rreg robust regression (outliers)

i.e. rreg inc school exp This command is for robust regressions it concerns point estimates more than standard errors, and it implements a datadependent method for downweighting outliers. Not to be used for heteroskedastic errors, because not the same as robust option
cluster commad
For example, you might think that in a panel of countries, errors are correlated across time but independent across countries. Then, you should cluster standard errors on countries.
i.e. regress y k, cluster(country)
Linear regression with panel data

Declaring the data to be a panel: Example, where data consists of many firms, each observed over 5 years
iis Firm ; tis Year ;
xt are the prefix for the commands in this class xtreg should be used for regressions with panel data
Fixed effects:
yit = a + xitb + vi + eit i.e. xtreg lnc lny, fe Equivalent to including a dummy variable for each case (i.e. firm).
Random effects (RE)

If you think some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects. Stata's RE estimator is a weighted average of fixed and between effects i.e. xtreg lnc lny, re
Choosing Between Fixed and Random Effects

running a Hausman test: estimate the FE model, save the coefficients, estimate the RE model, and then do the comparison. Example:
xtreg dependentvar var1 var2 var3 ... , fe estimates store fixed xtreg dependentvar var1 var2 var3 ... , re estimates store random hausman fixed random
If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.htm
Time series data

tsset declare data to be time-series data Examples:
tsset time, yearly (For an annual time series, time takes on values such as 1990, 1991, ...) tsset company year, yearly (For yearly panel data, variable company being the panel ID variable and year being a four-digit calendar year)
Serial correlation in residuals

Testing for first order serial correlation (Durbin-Watson statistic) reg col25 col2 col3 col7 if country=="Mexico estat dwatson Testing for higher order serial correlation (Breusch-Godfrey statistic) estat bgodfrey
Useful link
http://www.iies.su.se/~masa/stata.htm
This contains links to other STATA websites by topic
http://www.princeton.edu/~erp/stata/main.html http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm
GETTING MORE INFORMATION ABOUT STATA

The Help menu in STATA STATA has detailed help files available for all STATA commands. STATA commands are described in detail in the STATA User s Guide and Reference Manual. www.stata.com. Finally, you can find several good STATA tutorials on the Web. An easy way to find a list is to do a Google search for Stata tutorial.
(This tutorial was prepared using information from STATA Tutorial to accompany Stock/Watson Introduction to Econometrics Pearson 2003. )

Tutorial

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Tutorial

Enviado por

Direitos autorais:

Formatos disponíveis

STATA Tutorial- MFE

Stata 10: How to get it

Data library 5th floor Robars

International Trade and Finance data:

National income statistics:

Working with STATA

Data editor/data browser

The log using command

Loading your data

Using a dictionary file

Two types of variables:

Ending the do file

Comments in your do file:

My typical admin commands

Sorting the data

Deleting variables and observations

Deleting variables and observations

Saving your data

More on data cleaning

From long to wide:

Other uses for [_n-1]

New data after collapse

preserve and restore

Testing if means are equal

Reading the output table

Reading the output table

Reading the output table

Saving the residuals

Linear regression with heteroskedastic errors

Note: rreg robust regression (outliers)

Linear regression with panel data

Random effects (RE)

Choosing Between Fixed and Random Effects

If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.htm

Time series data

Serial correlation in residuals

GETTING MORE INFORMATION ABOUT STATA

Você também pode gostar