Você está na página 1de 70

STATA Tutorial- MFE

Elena Capatina elena.statahelp@gmail.com Office hours: Mondays 2-4pm, GE313 Wed 2-4pm, GE213 (starting Feb25th)

Stata 10: How to get it


Buy from Stata website, pick up at Robarts:
http://www.stata.com/order/new/edu/gradplans/cgpcampusorder.html

or
http://www.utoronto.ca/ic/software/detail/stata.html

Finding Data
http://datacentre.chass.utoronto.ca/
For some data, need to access from campus, sometimes need to go in person to the datacentre Access off-campus using UTORvpn (download from library website)

Data library 5th floor Robars

Data on CHASS
Canadian data:
CANSIM Census Analyser

International Trade and Finance data:


Trade analyser: by partner, commodity, on M and X IMF trade data

Data on CHASS
Financial markets data:
CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies Canadian Financial Markets Research Centre Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database

Data on CHASS
Companies financial data:
Financial Post Corporate Database COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies

National income statistics:


OECD National Accounts Database World Bank databases Penn World Tables

STATA windows
The command window The viewer/results window The review of commands window The variable window

Working with STATA


1. From the command window 2. Using a do file

The do file
A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as filename.do for STATA to read it From the STATA do-file editor, click do for STATA to execute all commands Can highlight and click do to execute only the highlighted command lines

Data editor/data browser


Shows you your data Check this frequently, especially after commands you are unsure about

Type of commands
1. Administrative commands that tell STATA where to save results, how to manage computer memory, and so forth 2. Commands that tell STATA to read and manage datasets 3. Commands that tell STATA to modify existing variables or to create new variables 4. Commands that tell STATA to carry out the statistical analysis

Example: stata1.do
clear log using "C:\Users\Elena\Documents\Various\STATATA\stata1.log", replace use "C:\Users\Elena\Documents\Various\STATATA\caschool.dta" describe generate income = avginc*1000 summarize income log close exit

The log using command


The log file is an output file Creates and saves a log with all the actions performed by STATA and all the results How to view it later?
In Stata, go to File , then log , view , and search for your filename, keeping in mind it has extension .log

Loading your data


If your data is in STATA format, ie filename.dta , then enter: use filename.dta If your data is a comma delimited file: insheet using filename.txt For other formats, can use StatTransfer to convert to STATA format

Using a dictionary file


A dictionary file reads data with extensions .raw ( .dat too) infile command
i.e. infile using dictionaryname.dct

infix command
Use if you prefer to copy and paste your dictionary in a do file

Useful Commands:
describe :
STATA will list all the variables, their labels, types, and tell you the # of observations

Two types of variables:


1. Numerical 2. String (usually appear in red in the data browser) You can convert a string variable to numerical using the destring command: ie. destring var1, replace or destring var1, force replace

More commands:
generate or gen
Creates a new variable i.e. generate income = avginc*1000 i.e. generate log_inc = log(income) i.e. gen inc_sq = (income)^2

More commands:
summarize
tells STATA to compute summary statistics (mean, standard deviations, and so forth) for all variables Useful to identify outliers and get an idea of your data i.e. summarize i.e. summ income inc_sq

Ending the do file


log close closes the file stata1.log that contains the output. The command exit tells STATA that the program has ended.

Example: stata2.do
# delimit ; * Administrative Commands; set more off; clear; log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace * Read in the Dataset; use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta" describe; * Transform data and Create New Variables; **** Construct average district income in $'s; generate income = avginc*1000; * Carry Out Statistical Analysis; ***** Summary Statistics for Income; summarize income; * End the Program ; log close; exit;

Comments in your do file:


Asterisk: STATA ignores the text that comes after * (does not execute them) these lines can be used to describe what the commands are doing, or allows you to write comments.
i.e. * Administrative Commands

Useful commands
# delimit ;
tells STATA that each STATA command ends with a semicolon. Useful for long commands Do not forget the ; and write this even after the comment lines that start with *.

Useful Commands
set more off
Ensures STATA executes all commands. Otherwise, if your code is too long, the output window might be filled, and STATA will display --more-- at the bottom, not executing all commands

Increasing memory
set memory 600m You may also need to increase the number of variables allowed by Stata: cannot be done with IC Stata

My typical admin commands


clear #delimit ; set more off; set mem 1200000; cap log close; cd "C:\NLSY_1\mainfiles\"; log using "Logs\Calibration_statistics.log", replace;

Other commands
tabulate
i.e. tabulate county Shows the frequency and percent of each value of county in the dataset

The if command
i.e. generate teachers_new= teachers if teachers<=10 replace teachers_new=0 if teachers>10 i.e. summarize teachers if county== Nevada

Operators
< less than > greater than <= less than or equal to >= greater than or equal to == equal to ~= not equal to

Sorting the data


sort
i.e. sort income i.e. sort county income

The by command
i.e. by county, sort: summarize income

Deleting variables and observations


drop
i.e. drop avginc - this drops the variable acginc - i.e. drop if teachers<=5
- this deletes only the observations for which teachers is less than 5.

Deleting variables and observations


Keep
i.e. keep if teachers>=7

Combining datasets
merge command
use "My Statistics\_respondent.dta", clear; sort ID; merge ID using "My Statistics\_annualfile.dta"; sort ID year; merge (ID year) using "temp1.dta";

Statistical relationships
1. Correlations: correlate
i.e. correlate income teachers i.e. correlate income teachers computers

2. Regressions: reg
i.e. reg income teachers i.e. reg income teachers computers

Graphs
Scatter Plots
i.e. twoway (scatter income computer)

Loops
forvalues Generate 100 uniform random variables named x1, x2, ..., x100: forvalues i = 1(1)100 { generate x`i' = uniform() } Divide a dataset into two datasets, each with a different education forvalues e=1/2{; use "My Statistics\Maleyearly.dta", clear; keep if education==`e'; save "My Statistics\Males_`e'.dta", replace; }

Collapse command
Creates a new dataset with the specified variables summarizing current data
i.e. collapse (mean) no_kids, by (education age status);

Saving your data


Saving in Stata format:
i.e. save file name.dta

You can export your data in another format from File , then Export , then choose the type of file you want.

More on data cleaning


reshape
From long to wide or from wide to long Example:

Wide data:
id 1 2 3 4 5 6

grp 1 1 2 2 3 3

lrn95 7 13 15 21 9 16

lrn96 12 14 14 27 9 17

lrn97 16 15 20 24 12 18

Reshape
Example of long data:
id 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6

year 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97 95 96 97

grp 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

lrn 7 12 16 13 14 15 15 14 20 21 27 24 9 9 12 16 17 18

Reshape
From wide to long:
i.e. reshape long lrn, i(id) j(year)

From long to wide:


i.e. reshape wide lrn, i(id) j(year)

Source: http://www.ats.ucla.edu/stat/stata/notes/reshape.htm

tabstat command
i.e. by ay: tabstat N if INC==2 & education1==1, s(n mean max min p50 p25 p75);

egen command
Extended generate command.
More powerful than generate

Examples:
egen age_mean = mean(age), by(year) egen totalsum = total(x) egen stdage = std(age)

Lagged variables
[_n-1] tells STATA this is the previous observation [_n-2] is 2 observations before Examples: (assuming data is sorted)
gen GDP_lagged= GDP[_n-1] gen GDP_2= GDP[_n-2]

Other uses for [_n-1]


Filling in missing data i.e. by ID: replace education=1 if education[_n-1]==1 & education[_n+1]==1 & ID[_n-1]==ID[_n+1];

Shortcut: local
Example using local :
local t = 80; while `t'<=98{; use "Tagsets\status_`t'.dta", clear; do "rename_status.do"; save "weeklyfile_`t'.dta"; local t = `t'+2; };

Collapse command
Initial data:
famid 1 1 1 2 2 2 3 3 kidname birth Beth Bob Barb Andy Al Ann Pete Pam age 1 2 3 1 2 3 1 2 9 6 3 8 6 2 6 4 wt 60 40 20 80 50 20 60 40

New data after collapse


collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
famid avgage 1 6 2 5.333333 3 4 avgwt 40 50 40 numkids 3 3 3

create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family, and the number of kids (numkids) per family
Source: http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm

preserve and restore


preserve tells STATA to keep your data in memory, so if your next commands modify it, you can come back to your original data Example:
use data1.dta preserve collapse (mean) age, by (family) save data2.dta restore

Tests of significance
i.e. ttest sysbp = 122.3 , level(95)
computes the sample mean and standard deviation of the variable sysbp, computes a t-test that the population mean is equal to 122.3, and computes a 95% confidence interval for the population mean

Source: mhtml:http://www.biostat-edu.com/files/Stata_Program_Notes_Chapter_8posted.mht

STATA output
. ttest sysbp = 122.3; One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------sysbp | 199 125.8241 1.288642 18.17853 123.2829 128.3653 -----------------------------------------------------------------------------mean = mean(sysbp) t = 2.7348 Ho: mean = 122.3 degrees of freedom = 198 Ha: mean < 122.3 Pr(T < t) = 0.9966 Ha: mean != 122.3 Pr(|T| > |t|) = 0.0068 Ha: mean > 122.3 Pr(T > t) = 0.0034

Testing if means are equal


ttest testscr_lo=testscr_hi, unequal unpaired
test the hypothesis that testscr_lo and testscr_hi come from populations with the same mean. computes the t-statistic for the null hypothesis that the mean of testscr_lo is the same as the mean of testscr_hi unequal tells STATA that the variances in the two populations may not be the same. unpairedtells STATA that the observations are for different districts

ttest testscr_lo=testscr_hi, unequal unpaired; Two-sample t test with unequal variances -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------testsc~o | 238 657.3513 1.254794 19.35801 654.8793 659.8232 testsc~i | 182 649.9788 1.323379 17.85336 647.3676 652.5901 ---------+-------------------------------------------------------------------combined | 420 654.1565 .9297082 19.05335 652.3291 655.984 ---------+-------------------------------------------------------------------diff | 7.37241 1.823689 3.787296 10.95752 -----------------------------------------------------------------------------diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426 Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607 Ha: diff < 0 Pr(T < t) = 1.0000 Ha: diff != 0 Ha: diff > 0 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

Simple regression
regress science math female socst read
Source | SS df MS -------------+-----------------------------Model | 9543.72074 4 2385.93019 Residual | 9963.77926 195 51.0963039 -------------+-----------------------------Total | 19507.5 199 98.0276382 Number of obs = 200 F( 4, 195) = 46.69 Prob > F = 0.0000 R-squared = 0.4892 Adj R-squared = 0.4788 Root MSE = 7.1482

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 -----------------------------------------------------------------------------Source: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm

Reading the output table


SSTotal --The total variability around the mean. S(Y - Ybar)2. SSResidual --The sum of squared errors: S(Y Ypredicted)2 SSModel -- SSTotal - SSResidual. Note that SSModel / SSTotal is equal to .4892, the value of R-Square ( =proportion of the variance explained by the independent variables)

Reading the output table


df - These are the degrees of freedom associated with the sources of variance. MS - These are the Mean Squares, the Sum of Squares divided by their respective DF.

Reading the output table


Coefficients:
sciencePredicted = 12.32529 + .3893102*math + -2.009765*female +.0498443*socst+.3352998*read

t and P>|t| - These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0. [95% Conf. Interval] - This shows a 95% confidence interval for the coefficient. (the coefficient will not be statistically significant if the confidence interval includes 0)

Predicted values
After the regression, type predict yhat Creates a new variable yhat with the predicted values for the dependant variable

Saving the residuals


predict r, residuals Checking homoskedasticity of residuals i.e. rvfplot, yline(0)
Plots the residuals against the predicted values

i.e. estat imtest (White test) i.e. estat hettest (Breusch-Pagan test)
Http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm http://www.nd.edu/~rwilliam/stats2/l25.pdf

Linear regression with heteroskedastic errors


Need robust standard errors (Huber-White): - use the robust option with regress
i.e. reg teachers meal_pct expn_stu, robust
Linear regression Number of obs = 420 F( 2, 417) = 9.58 Prob > F = 0.0001 R-squared = 0.0232 Root MSE = 186.17

-----------------------------------------------------------------------------| Robust teachers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------meal_pct | .8239426 .336556 2.45 0.015 .1623848 1.4855 expn_stu | -.026066 .0098221 -2.65 0.008 -.045373 -.0067591 _cons | 230.7061 59.44205 3.88 0.000 113.8627 347.5496 ------------------------------------------------------------------------------

Note: rreg robust regression (outliers)


i.e. rreg inc school exp This command is for robust regressions it concerns point estimates more than standard errors, and it implements a datadependent method for downweighting outliers. Not to be used for heteroskedastic errors, because not the same as robust option

cluster commad
For example, you might think that in a panel of countries, errors are correlated across time but independent across countries. Then, you should cluster standard errors on countries.
i.e. regress y k, cluster(country)

Linear regression with panel data


Declaring the data to be a panel: Example, where data consists of many firms, each observed over 5 years
iis Firm ; tis Year ;

xt are the prefix for the commands in this class xtreg should be used for regressions with panel data

Fixed effects:
yit = a + xitb + vi + eit i.e. xtreg lnc lny, fe Equivalent to including a dummy variable for each case (i.e. firm).

Random effects (RE)


If you think some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects. Stata's RE estimator is a weighted average of fixed and between effects i.e. xtreg lnc lny, re

Choosing Between Fixed and Random Effects


running a Hausman test: estimate the FE model, save the coefficients, estimate the RE model, and then do the comparison. Example:
xtreg dependentvar var1 var2 var3 ... , fe estimates store fixed xtreg dependentvar var1 var2 var3 ... , re estimates store random hausman fixed random

If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.htm

Time series data


tsset declare data to be time-series data Examples:
tsset time, yearly (For an annual time series, time takes on values such as 1990, 1991, ...) tsset company year, yearly (For yearly panel data, variable company being the panel ID variable and year being a four-digit calendar year)

Serial correlation in residuals


Testing for first order serial correlation (Durbin-Watson statistic) reg col25 col2 col3 col7 if country=="Mexico estat dwatson Testing for higher order serial correlation (Breusch-Godfrey statistic) estat bgodfrey

Useful link
http://www.iies.su.se/~masa/stata.htm
This contains links to other STATA websites by topic
http://www.princeton.edu/~erp/stata/main.html http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm

GETTING MORE INFORMATION ABOUT STATA


The Help menu in STATA STATA has detailed help files available for all STATA commands. STATA commands are described in detail in the STATA User s Guide and Reference Manual. www.stata.com. Finally, you can find several good STATA tutorials on the Web. An easy way to find a list is to do a Google search for Stata tutorial.
(This tutorial was prepared using information from STATA Tutorial to accompany Stock/Watson Introduction to Econometrics Pearson 2003. )

Você também pode gostar