Você está na página 1de 66

A First Tutorial in Stata

Stan Hurn
Queensland University of Technology
National Centre for Econometric Research
www.ncer.edu.au
Stan Hurn (NCER) Stata Tutorial 1 / 66
Table of contents
1
Preliminaries
2
Loading Data
3
Basic Descriptive Statistics
4
Basic Plotting
5
Simple Data Manipulation
6
Simple Linear Regression
7
Using do les
8
Some Regression Examples
Electricity Data
California Schools Data
Food Expenditure and Income
9
Instrumental Variables Estimation
Wage Data
Articial Data
Stan Hurn (NCER) Stata Tutorial 2 / 66
Preliminaries
Stata
Stata is a fast, powerful statistical package with
smart data-management facilities,
a wide array of up-to-date statistical techniques,
and an excellent system for producing publication-quality graphs
The bad news is that Stata is NOT as easy to use as some other statistical
packages, but Version 12 has got a reasonable menu-driven interface. On the
whole the advantages probably outweigh the steepness of the initial learning curve.
Stan Hurn (NCER) Stata Tutorial 3 / 66
Preliminaries
Stata Resources
One of the major advantages to using Stata is that there are a large number of
helpful resources to be found. For example:
a good web-based tutorial can be found at
http://data.princeton.edu/stata/default.html
a useful introductory book is
An Introduction to Modern Econometrics Using Stata by Christopher F. Baum
published by Stata Press in 2006
Stan Hurn (NCER) Stata Tutorial 4 / 66
Preliminaries
The Stata 12 Front End for Mac
Stan Hurn (NCER) Stata Tutorial 5 / 66
Preliminaries
The Stata 12 Front End for Windows
Stan Hurn (NCER) Stata Tutorial 6 / 66
Preliminaries
Stata 12 Front End
Stata has an menu bar on the top and 5 internal windows.
The main window is the one in the middle (1 on the previous slide). It gives
you the all output of you operations in Stata.
The Command window (2) executes commands. You can type commands
directly in this window as an alternative to using the menu system.
The Review window (3), lists all the operations preformed since opening
Stata. If you click on one of your past commands, you will see the command
being displayed in the Command window and you can re-run it by hitting the
enter key.
The Variables window (4) lists the variables in the current dataset (and their
descriptions). When you double-click on the variable, it appears in the
Command window.
The Properties window (5) gives information about your dataset and your
variables.
Stan Hurn (NCER) Stata Tutorial 7 / 66
Preliminaries
Changing the Working Directory
To avoid having to specify the path each time you wish to load a data le or
run a Stata program (saved in a do le), it is useful to changed the
working directory so that Stata looks in the directory that you are currently
working in.
Click File Change Working Directory
Browse for the correct directory and select it.
The result is printed out in the Results window and the appropriate Stata
command is echoed in Review window enabling you to reconstruct a do
le of you session.
Stan Hurn (NCER) Stata Tutorial 8 / 66
Loading Data
Loading an Existing Stata File
Simply click File Open and browse for an existing Stata data le.
Stata data les have extensions dta.
Open the le food.dta. You will note that two variables food exp and
income appear in the Variables window of the Stata main page.
In the Properties window you will see the lename food.dta together with
some information about the le. This le has 2 variables, each with 40
observations and the size of the le in memory is also given.
Stan Hurn (NCER) Stata Tutorial 9 / 66
Loading Data
Loading an Excel File
Stan Hurn (NCER) Stata Tutorial 10 / 66
Loading Data
Loading an Excel File
Load the Excel le US Macroeconomic Data.xls
Click File Import Excel Spreadsheet
Browse for the correct le in the working directory and open it.
Remember to check the radio button asking if you want to use the rst row
as variable names.
Changes variable names in Stata is something of a mystery when using the
Menu. But using the command window is easy enough.
rename oldname newname
will do the trick. Try it.
NOTE Case matters: if you use an uppercase letter where a lowercase letter
belongs, or vice versa, an error message will display.
Stan Hurn (NCER) Stata Tutorial 11 / 66
Loading Data
Loading a CSV File
Load the CSV le taylor.csv which contains data on the output gap, the
ination gap and the Federal Funds rate for the period 1961:Q1 to 1999:Q4.
Click File Import Text data created by a spreadsheet
Browse for the le and load it. You should have data on the variables r, in
and ygap.
To specify this as time series data we need a series of dates. The date vector
(called year) is created using the following commands
generate year = tq(1961q1) + _n-1
To make sure Data understands that this is a time series data set we need to
tell it to use year as the date vector. The command is
tsset year, quarterly
The Stata menu command is to do this is found on the next slide.
Stan Hurn (NCER) Stata Tutorial 12 / 66
Loading Data
Assigning a Date Vector
Stan Hurn (NCER) Stata Tutorial 13 / 66
Basic Descriptive Statistics
Summary Statistics
Reload the le food.dta.
Now click Statistics and then choose
Summaries, tables, and tests Summary and descriptive statistics.
Sometimes it is useful to have a look at the histogram of the data. Click
Graphics Histogram and experiment with some of the options.
Another useful visual tool is the box plot. Click Graphics Box plot
Stan Hurn (NCER) Stata Tutorial 14 / 66
Basic Plotting
Simple Scatter
Click File Open and browse for food.dta. This is a Stata data le.
Click Grahics Twoway and create a simple scatter plot of weekly food
expenditure versus weekly income.
Stan Hurn (NCER) Stata Tutorial 15 / 66
Basic Plotting
Time Series Plots
Lets work through a simple example to construct a plot of the Australian business
cycle.
Click File Import Excel Spreadsheet and use the rst row as variable
names. This will give you a variable gdp.
Make a time series data set by creating a quarterly date vector from 1959:Q2
to 1996:Q1 and make a time-series data set using dates as the time vector.
The commands are
generate dates = tq(1959q2) + _n-1
tsset dates, quarterly
Plot the data.
Stan Hurn (NCER) Stata Tutorial 16 / 66
Basic Plotting
Australian GDP
Stan Hurn (NCER) Stata Tutorial 17 / 66
Simple Data Manipulation
Data Transformations
Statas basic commands for data transformation are generate and replace.
generate creates a new variable.
replace modies an existing variable.
Both commands are accessed via the Data menu item on the main Stata
toolbar.
Stan Hurn (NCER) Stata Tutorial 18 / 66
Simple Data Manipulation
generate and replace
Stan Hurn (NCER) Stata Tutorial 19 / 66
Simple Data Manipulation
Growth rate of Australian GDP
Create a growth rate of gdp using the L. operator (lag operator)
generate g = log(gdp)-log(L1.gdp)
Stan Hurn (NCER) Stata Tutorial 20 / 66
Simple Data Manipulation
Australian Business Cycle
While the plot of the growth rate of gdp is more informative than a plot of the
level of the series, yet more information can be obtained by smoothing g.
generate bcycle = (L3.g+L2.g+L1.g+g+F1.g+F2.g+F3.g )/7
Stan Hurn (NCER) Stata Tutorial 21 / 66
Simple Data Manipulation
Load the food data set
1
Make sure you are in the right working directory (File Change Working
Directory)
2
Load the dataset in food.dta and look at the data characteristics.
3
You can experiment using Statistics Summaries, tables, and tests
Summary and descriptive statistics but it is simpler to issue the following
commands from the command window.
describe
list
browse
summarize
summarize food exp, detail
Stan Hurn (NCER) Stata Tutorial 22 / 66
Simple Data Manipulation
Simple scatter plots
1
Use Grahics Twoway to create a simple scatter plot of weekly food
expenditure versus weekly income.
2
Issue the command
twoway (scatter food exp income)
3
Issue the command
twoway (scatter food exp income), title(Food Expenditure Data)
4
Issue the command
twoway (scatter food exp income) (lt food exp income), title(Fitted
Regression Line)
The line of best t is obtained by linear regression of food expenditure on income.
We will now explore this in more detail.
Stan Hurn (NCER) Stata Tutorial 23 / 66
Simple Linear Regression
A First Regression
1
Load the data set caschool.dta.
2
Run a regression of the test scores, testscr , against the student-teacher ratio,
str . You do this by selecting Statistics Linear models and related
Linear regression.
3
A dialogue box will pop up which will require you to ll in the dependent and
independent variable.
Stan Hurn (NCER) Stata Tutorial 24 / 66
Simple Linear Regression
Regression dialogue box
Stan Hurn (NCER) Stata Tutorial 25 / 66
Simple Linear Regression
Regression Results
Stata reports the regression results as follows:
The regression predicts that if class size falls by one student, the test scores will
increase by 2.28 points.
Stan Hurn (NCER) Stata Tutorial 26 / 66
Simple Linear Regression
Predicted Values and Residuals
A common task after running a regression is storing the tted values, y, or the
residuals, u. Here you must become familiar with the very useful Statistics
Postestimation menu. One option to select is Predictions, residuals, etc which
gives the dialogue box
Stan Hurn (NCER) Stata Tutorial 27 / 66
Simple Linear Regression
Predicted Values and Residuals
1
Note that the names you choose for the predicted values and/or residuals
cannot already be taken. Use something obvious like yt or yhat for the
tted values and res or uhat for the residuals.
2
You can also use the Postestimation option to obtain condence intervals
for the prediction using the option Standard errors of the prediction. Save
this as yhatci . The commands
. gen yhatu = yhat+1.96*yhatci
. gen yhatl = yhat - 1.96*yhatci
will now generate a 95% condence interval for the prediction.
3
To be more precise you could use the t-distribution rather than hard-code
1.96. The commands are
. gen ttail = invttail(e(df_r),0.975)
. gen yhatu = yhat+ttail*yhatci
Note that e(df_r) is the way Stata stores the degrees of freedom for the
residuals and invtttail computes the relevant critical value from the
t-distribution.
Stan Hurn (NCER) Stata Tutorial 28 / 66
Simple Linear Regression
Predictions with 95% Condence Interval
Stan Hurn (NCER) Stata Tutorial 29 / 66
Simple Linear Regression
Out-of-sample Prediction
Obtaining out-of-sample predictions is a bit clunky and using the command line is
probably the way to go. Suppose there are 40 observations in the data sample and
you want to obtain an out-of-sample prediction for a value of the explanatory
variable income = 20. The code is
// add observation to data file
edit
set obs 41
replace income=20 in 41
// obtain prediction
predict yhat0
list income yhat0 in 41
Stan Hurn (NCER) Stata Tutorial 30 / 66
Simple Linear Regression
You should explore other visualisation options
Stan Hurn (NCER) Stata Tutorial 31 / 66
Using do les
Using do les
A nice thing about Stata is that there is a simple way to save all your work steps
so you or others can easily reproduce your analysis.
The way to do so is using a so-called do le.
Remember that all Stata does is to execute commands, which you either
clicked on using the menu or directly typed in the Command window.
A command is just one line of text (or code). If you want to save this
command for later use, just copy it (simply click on it in the Review window
and copy the line of text that comes up in the Command window) and paste
it into the do le.
The next slides describe how you can open and use a do le.
Stan Hurn (NCER) Stata Tutorial 32 / 66
Using do les
Where to open a new do le
You can open a new do le by clicking on the New Do le Editor button below
the menu (or press Ctrl+9):
Stan Hurn (NCER) Stata Tutorial 33 / 66
Using do les
Using a do le
A do le is just a list of commands. Each command has to start with a new line.
Normally you will start your do le telling it which data to load in the rst line. In
the following lines you can then include analysis commands. If you leave a row
empty no problem. If you want to write comments or text, which are not Stata
code, you have to start the row with // or a * symbol; using these symbols tell
Stata that this line is not to be executed.
Stan Hurn (NCER) Stata Tutorial 34 / 66
Using do les
Executing commands with a do le
If you want to re-run a command from the do le, just highlight the line and press
the Execute (do) button (or press Ctrl+d). If you dont mark any specic line,
Stata will run all the commands in the do le you have currently opened from rst
to last. The results of the command(s) are displayed in the main view as if you
were using the menu.
Stan Hurn (NCER) Stata Tutorial 35 / 66
Some Regression Examples Electricity Data
Demand for Residential Electricity
The Excel le elecex.xls has quarterly data on the following variables from
1972:02 to 1993:04.
RESKWH = electricity sales to residential customers (million kilowatt-hours)
NOCUST = number of customers (thousands)
PRICE = electricity tari (cents/kwh)
CPI = consumer price index
INCOME = nominal personal income (millions of dollars)
CDD = cooling degree days
HDD = heating degree days
POP = population (thousands)
Import the data into Stata using the Import wizard. Take care to check the Radio
Button asking whether or not to treat the rst row as variable names! Once done
you can save this as elecex.dta for your own convenience.
Stan Hurn (NCER) Stata Tutorial 36 / 66
Some Regression Examples Electricity Data
Time Series Data
Most multiple regression exercises involve data manipulation. This is where
writing do les is a powerful way of ensuring that you can recover your previous
work and others can reproduce it.
1
This is time series data, so we need to create a date vector set dates as the
date vector.
generate dates = tq(1972q2) + _n-1
tsset dates, quarterly
Stan Hurn (NCER) Stata Tutorial 37 / 66
Some Regression Examples Electricity Data
Data Manipulations
1
Generate the dependent variable:
gen LKWH=log(RESKWH/NOCUST)
2
We want to explain this demand in terms of real per capita income so create
the variable
gen LY=log((100\ast INCOME)/(CPI\ast POP))
3
Another important determinant is price we want to use the real average
cost of electricity
gen LPRICE=log(100 \ast PRICE/CPI)
Stan Hurn (NCER) Stata Tutorial 38 / 66
Some Regression Examples Electricity Data
Getting a Feel for the Data
You should always try to understand your data before beginning to model it. A
useful starting point is the Graphics Scatterplot matrix option. As the name
suggests this creates a matrix of scatterplots of the variables against each other.
Hopefully this reveals some pattern to the relationships between the dependent
and explanatory variables and no discernible pattern between the explanatory
variables themselves.
Stan Hurn (NCER) Stata Tutorial 39 / 66
Some Regression Examples Electricity Data
Matrix Plots
Stan Hurn (NCER) Stata Tutorial 40 / 66
Some Regression Examples Electricity Data
Regression Results
The results from running the linear regression of the base model of demand on
price, income and the weather variables are as follows:
Stan Hurn (NCER) Stata Tutorial 41 / 66
Some Regression Examples Electricity Data
ACF and PACF
This is time series data, so one of the problems may be autocorrelation in the
residuals. The autocorrelation function and partial autocorrelation function of the
residuals look as follows
Stan Hurn (NCER) Stata Tutorial 42 / 66
Some Regression Examples Electricity Data
AR(1) Estimation Options
The following dialogue box under the Time Series Prais-Winstein regression allows
you to correct for autocorrelation in the residuals.
Stan Hurn (NCER) Stata Tutorial 43 / 66
Some Regression Examples Electricity Data
AR(1) output
The results from running the linear regression of the AR(1) model of demand on
price, income and the weather variables are as follows:
Stan Hurn (NCER) Stata Tutorial 44 / 66
Some Regression Examples California Schools Data
California Test Score Data
1
Load the le caschool.dta.
2
Run the regression relating test scores to the student teacher ratio
testscr =
0
+
1
str + u
3
The concern is that this equation suers omitted variable bias which we can
correct using multiple regression. Try relating test scores to the student
teacher ratio and the percentage of English learners
testscr =
0
+
1
str +
2
el pct + u
Note that the size of the eect of str is halved!
4
Now try adding expenditure per student to the regression
testscr =
0
+
1
str +
2
el pct +
3
expn stu + u
Stan Hurn (NCER) Stata Tutorial 45 / 66
Some Regression Examples California Schools Data
Presenting Results
This exercise has shown that the coecient on str in the simple two variable
model is biased. But the question remains as to how to present this in a
reasonable way so that we can see the pattern immediately. The answer is to store
the results of the regressions and then to use Statas Postestimation menu item
to help organise the presentation of the results.
Unfortunately this is going to involve estimating the regressions again and then
using
Statistics Postestimation Manage estimation results Store in memory
After each estimation you will need to name your model. Lets be original and call
them Model1, Model2 and Model3. As you do this, watch how Stata echoes your
command and think how easy it would be to use a do le instead.
Stan Hurn (NCER) Stata Tutorial 46 / 66
Some Regression Examples California Schools Data
Table of Estimation Results 1
Show the results: estimates table Model1 Model2 Model3
Here both coecients and standard errors of the various models are summarised
in an accessible way and the reduction in the signicance of str is clear.
Stan Hurn (NCER) Stata Tutorial 47 / 66
Some Regression Examples California Schools Data
Table of Estimation Results 2
Further detail on the results: estimates table Model1 Model2 Model3,
star(.05 .01 .001)
This is a particularly useful way of summarising the results as the signicant
coecients are marked. Note how str is insignicant in Model 3. Essentially the
t-tests on the individual coecients are interpreted for you!!
Stan Hurn (NCER) Stata Tutorial 48 / 66
Some Regression Examples California Schools Data
Joint Signicance Test
Now lets test the hypothesis that both str and exp stu are zero.
The tests are to be found at:
Statistics Postestimation Tests Test linear hypotheses
Obviously you are going to have to give Stata some information on which
coecients you wish to test. Once you have selected Test linear hypotheses,
click on Create and the following dialogue box with appear.
Stan Hurn (NCER) Stata Tutorial 49 / 66
Some Regression Examples California Schools Data
Testing Joint Hypotheses
The result shows that the p-value of the F-test of the joint hypothesis that

1
=
3
= 0 is 0.0004 so we would reject the null hypothesis. At least one of str
and exp stu is a signicant factor in the regression.
Stan Hurn (NCER) Stata Tutorial 50 / 66
Some Regression Examples California Schools Data
Testing Joint Hypotheses for Windows
The result shows that the p-value of the F-test of the joint hypothesis that

1
=
3
= 0 is 0.0004 so we would reject the null hypothesis. At least one of str
and exp stu is a signicant factor in the regression.
Stan Hurn (NCER) Stata Tutorial 51 / 66
Some Regression Examples Food Expenditure and Income
Food Data Set
Study the relationship between food expenditures and income
reg food exp income and plot residuals
Stan Hurn (NCER) Stata Tutorial 52 / 66
Some Regression Examples Food Expenditure and Income
Functional Form
It may be that a linear relationship between food expenditures and income is
not a good choice.
Let us try to t a linear - log model.
food exp =
0
+
1
ln(income) + u
Unfortunately Stata doesnt recognise ln(income) and you have to generate a
new variable, say
gen lincome = log(income)
Stan Hurn (NCER) Stata Tutorial 53 / 66
Some Regression Examples Food Expenditure and Income
Fitted Values
Stan Hurn (NCER) Stata Tutorial 54 / 66
Some Regression Examples Food Expenditure and Income
Elasticities
Now you can calculate the percentage change in food expenditure given a 1
percent change in income using the marginal eects options on the
Postestimation menu.
Stan Hurn (NCER) Stata Tutorial 55 / 66
Instrumental Variables Estimation Wage Data
Wage Data
This example looks at wage data. The datale is mroz.dta and the focus is on
modelling the wage of married women only. The variables that are important are
as follows:
educ = years of schooling
wage = estimated wage from earns., hours
motheduc = mothers years of schooling
fatheduc = fathers years of schooling
exper = actual labor mkt exper
lfp = 1 if in labor force, 1975
Stan Hurn (NCER) Stata Tutorial 56 / 66
Instrumental Variables Estimation Wage Data
Estimating a Wage Equation
Suppose we wish to estimate the equation that relates wages to education and
experience:
ln(wage) =
0
+
1
educ +
2
exper +
3
exper
2
+ u
t
.
The problem is that educ may be correlated with u because it is an imperfect
proxy for ability and that using OLS may therefore result in biased coecient
estimates.
Stan Hurn (NCER) Stata Tutorial 57 / 66
Instrumental Variables Estimation Wage Data
OLS Results
Stan Hurn (NCER) Stata Tutorial 58 / 66
Instrumental Variables Estimation Wage Data
The IV Estimator
We can now try estimate the regression by IV using mothereduc as an instrument
for educ. A mothers education does not itself belong in the daughters wage
equation, but it is reasonable to propose that more educated mothers are more
likely to have educated daughters.
Click Statistics Edogenous Covariates Single-equation
instrumental-variables estimator
This sequence will open a Dialogue Box which will prompt for more
information like
1
dependent variable, independent variables, endogenous variables and
instrumental variables;
2
other options for the constant and standard error correction etc.
Stan Hurn (NCER) Stata Tutorial 59 / 66
Instrumental Variables Estimation Wage Data
The IV Estimator
Stan Hurn (NCER) Stata Tutorial 60 / 66
Instrumental Variables Estimation Wage Data
IV Results
Stan Hurn (NCER) Stata Tutorial 61 / 66
Instrumental Variables Estimation Wage Data
Some Observations
1
Although not shown here mothereduc is highly signicant in the rst-stage
regression of the IV estimation indicating it is a strong instrument for educ.
2
The estimated return to education is about 10% lower than the OLS
estimate. This is consistent with our earlier theoretical discussion that the
OLS estimator tends to over-estimate the eect of a variable if that variable
is positively correlated with the omitted factors present in the error term.
3
The standard error on the coecient on educ is over 2.5 times larger than
the standard error on the OLS estimate. This reects the fact that even with
a good instrument the IV estimator is not ecient. Of course this situation
can be remedied slightly by adding more valid instruments for educ.
Stan Hurn (NCER) Stata Tutorial 62 / 66
Instrumental Variables Estimation Articial Data
The Data
The datale is ivreg2.dta contains 500 articially generated observations on x, y,
z
1
and z
2
. The variable y is generated as
y
t
=
0
+
1
x
t
+ e
t
,
0
= 3,
1
= 1
with
x N(0, 2) , e N(0, 1) , cov(x, e) = 0.9 .
Note that

z
1
,x
= 0.5
z
2
,x
= 0.3.
Stan Hurn (NCER) Stata Tutorial 63 / 66
Instrumental Variables Estimation Articial Data
Summary of Estimation Results
Table was generated by using the Postestimation menu option to store results
and create a table.
Stan Hurn (NCER) Stata Tutorial 64 / 66
Instrumental Variables Estimation Articial Data
Hausman Test
To Implement the Hausman test assuming that you have stored the output from
the IV and OLS regressions you click
Postestimation Tests Hausman specication test
Stan Hurn (NCER) Stata Tutorial 65 / 66
Instrumental Variables Estimation Articial Data
Hausman Test
This indicates a strong rejection of the null hypothesis of exogeneity indicating
that cov(x, u) = 0 which we know to be true by construction.
Stan Hurn (NCER) Stata Tutorial 66 / 66

Você também pode gostar