Você está na página 1de 40

GETTING

STARTED WITH

STATA

Sbastien Fontenay
ECON - IRES

THE

SOFTWARE

Software developed in 1985 by StataCorp


Functionalities
Data management
Statistical analysis
Graphics

Using Stata at UCL


Computer labs

Socrate 30, 31-32, 33, 34, 54 and 68


Dupriez 143
Leclercq 74, 76, 77 and 78

Student licence to install on your personal computer

valid during all your studies at the price of 20 euros

www.uclouvain.be/438229.html

FINDING

SUPPORT (1)

Best documentation
help command
search keyword

Stata website : www.stata.com/support


Frequently Asked Questions
Video tutorials
Statalist

Books
Cahuzac, E., Bontemps, C. (2008). Stata par la pratique: Statistiques,
graphiques et lments de programmation.
Cameron, A.C., Trivedi, P.K. (2009). Microeconometrics using Stata.
Becketti, S. (2013). Time series using Stata.

UCLA : www.ats.ucla.edu/stat/stata

FINDING

SUPPORT (2)

For all your questions related to data management or analysis using


Stata
Website: http://www.uclouvain.be/411370
Email: sebastien.fontenay@uclouvain.be
By appointment only:
Btiment Dupriez (office d010), 3 place Montesquieu

COURSE

Quick tour of
Stata

Working environment
Writing commands

Data
management

Inputting data
Transforming data

Data analysis

Descriptive statistics
Linear regression
Exporting results

TOPICS

SECTION 1

QUICK TOUR OF STATA

Working environment
Writing commands

WORKING

ENVIRONMENT

The working environment is composed of 5 windows

Results

of commands

Variables
list and labels

Review

Properties

of commands

of variables and
dataset

Command
window

WORKING

ENVIRONMENT

Three specific windows can be opened by clicking on the following icons


Data editor/browser
Display data in memory

Viewer
Display log and help files

Do-file Editor
Text editor to save/execute commands

There are 3 main types of files used in Stata


.dta data
.do commands (do-file)
.smcl | .log output (log file)

WORKING

Data

All software functionalities


are available from the dropdown menus
Useful when you are unsure
of commands to run or
unfamiliar with available
options

Every command issued in


this manner is echoed to the
review and results windows

e.g. sysuse auto.dta

ENVIRONMENT

Graphics

Statistics

WORKING

ENVIRONMENT

In order to use Stata effectively, you should always follow this


three-step process:
Open a do-file
Choose your working directory
cd "C:\Users\Me"
mkdir stata_training
cd stata_training
-

You can see the current working directory at the bottom left of the main window

Start a log file (saving commands and their output)


log using filename [, text append replace]
-

log close
log off | on

SECTION 1

QUICK TOUR OF STATA

Working environment

Writing commands

WRITING

COMMANDS

Stata commands use a common syntax:


[prefix :] command [varlist] [= exp] [if] [in] [, options]
The square brackets denote qualifiers that are optional
Italicized words are to be substituted by the user

varlist denotes a list of variables


exp is a mathematical expression

Stata is case sensitive! (i.e. UPPERCASE != lowercase)

WRITING

COMMANDS

Operators may be used to manipulate numerical or string variables

Arithmetic

Logical

Relational

+
*
/
^

&
|
!
~

>
<
>=
<=
==
~=
!=

addition
subtraction
multiplication
division
raised to power

and
or
not
not

greater than
less than
> or equal
< or equal
equal
not equal
not equal

Pay attention that a double equal sign (==) is used for equality
testing

WRITING

COMMANDS

Logical and relational operators are particularly useful with if


qualifiers to define the sample for analysis
The if qualifier at the end of a command means the command is to
use only the data specified
command if exp

list
list
list
list

make if foreign==1
if make=="Volvo 260"
make price if price>=5000 & price<=7000
make price if price<5000 | price>7000

Note that character strings are enclosed in double quotes

WRITING

COMMANDS

You can refer to a list of numbers using the following shorthand


1/30
1/l
f/-5
-5/l

1 to 30
1 until last number
first to 5th number before the end
last five numbers

Numlists are particularly useful with the in qualifiers to specify a


range of observations to be used
command in range
list in f/10
list in -10/l
list make price in 74

WRITING

COMMANDS

The by prefix repeats execution of a command on subsets of the data


subsets are groups of observations that take the same value in a given
variable (often a categorical variable)
by varname: command
-

by foreign: list make

If the dataset is not sorted, you should use the bysort prefix instead
bysort varname: command

SECTION 2

DATA MANAGEMENT

Inputting data
Transforming data

INPUTTING

DATA

To open a dataset in Stata format (.dta): use


use filename [, clear]
sysuse - open example datasets installed with Stata

To save a dataset in Stata format: save


save filename [, replace]

Stata can also import/export Excel files (.xls or .xlsx)


import excel filename [, firstrow]
export excel filename [, firstrow(variables)]

By default, Stata opens/saves a dataset from/in the current working


directory but you can specify
another directory: use | save "C:\Users\Me\Stata_training\dataset.dta"
a web address: use http://sites.uclouvain.be/datasupport/data/wage.dta

INPUTTING

Summary of the dataset


describe: information on dataset in memory
codebook: detailed description of variables

Further explore data in memory


count: number of observations
list: display data in the results window

Manipulate variables/observations
keep wage educ exper
drop in 1/10
sort wage

DATA

SECTION 2

DATA MANAGEMENT

Inputting data
Transforming data

TRANSFORMING

DATA

To create a new variable: generate


generate newvar = exp [if] [in]
exp may be a number, a character string or a mathematical function
generate constant = 1
-

Create a constant equal to 1

generate constant_text = "text"


-

Create a constant that contains the character string "text"

generate logwage = ln(wage)


-

Create a variable equal to the natural logarithm of wage

generate expersq = expr^2


-

Create a variable equal to the square of exper

TRANSFORMING

DATA

To create specific variables using time series operators


generate lag_gdp = L.gdp
Create a variable corresponding to the first lag of gdp

generate lead_gdp = F.gdp


Create a variable corresponding to the first lead of gdp

generate diff_gdp = D.gdp


Create a variable corresponding to the first difference of gdp

But before you should tell Stata that you are working with time series
data using the command: tsset
tsset time [, yearly monthly quarterly daily]

Using system variables


generate gdp_growth = ((gdp[_n] - gdp[_n-1]) / gdp[_n-1])*100
Create a variable equal to the growth rate of gdp

TRANSFORMING

DATA

To modify an existing variable: replace


replace wage=20 if wage>=20

To rename an existing variable: rename


rename wage hourly_wage

You can also add a brief description to the variable using labels
label variable educ "total years of education"

TRANSFORMING

DATA

When transforming data, one must be careful with missing values


Missing values in Stata are coded with a . (period)

Stata treats missing values as large numbers, higher than any other
values of a given variable

In certain cases you should use the if qualifier to exclude missing values
generate rich = (wage>15) if wage<.
|or|
generate rich = (wage>15) if wage!=.
|or|

generate rich = (wage>15) if !missing(wage)

SECTION 3

DATA ANALYSIS

Descriptive statistics
Linear regression

Exporting results

DESCRIPTIVE

STATISTICS

Categorical variables
One-way table of frequencies
tabulate female
-

The option [, missing] displays the total frequency of missing observations

Two-way table of frequencies


tabulate female married

Continuous variables
summarize gives the number of observations, the mean, the standard
deviation, the minimum and maximum values
summarize wage educ
-

The option [, detail] displays the main quantiles, the highest and lowest five values, the
variance, as well as the skewness and kurtosis measures

Pearsons correlation coefficient


correlate varlist [, covariance]

DESCRIPTIVE

STATISTICS

Exploring data with graphs


Distribution of a continuous variable: histogram
histogram wage
-

the option [, normal] draws a normal density line on the plot

Scatter plot between two variables: scatter


scatter wage educ

Evolution of time series: tsline


tsline gdp
-

available only after tsset

SECTION 3

DATA ANALYSIS

Descriptive statistics
Linear regression

Exporting results

LINEAR

REGRESSION

We seek to estimate the relationship between one dependent variable


and a set of independent variables
using the Ordinary Least Squares (OLS) estimator

Classical linear model assumptions (Wooldridge, 2008):

Model is linear in parameters


Data are random sample of the population
No perfect collinearity between independent variables
Zero conditional mean of error term
Homoskedasticity
Normality of the residuals

LINEAR

REGRESSION

The model we want to estimate:


log(wage) = 0 + 1education + 2experience + 3tenure + u
where:
-

wage is average hourly earnings in dollars

education is the number of years of education

experience is the number of years of labour market experience

tenure is the number of years with the current employer

In Stata:
regress logwage educ exper tenure

LINEAR

Stata output

REGRESSION

LINEAR

Analysis of variance
Sum of Squares (SS)
Explained variance (model)
Residual variance
Total variance

Degrees of freedom (df)


Mean Squares (MS)
SS divided by df

REGRESSION

LINEAR

REGRESSION

Overall model fit


Number of observations
F-statistic
p-value associated with the F-statistic
testing the null hypothesis that all of the model
coefficients are 0

R-squared
proportion of variance in the dependent variable
explained by the independent variables
-

SS(model) divided by SS(total)

Adjusted R-squared
Standard deviation of the error term

()

LINEAR

REGRESSION

Parameters estimates
Dependent variable (1)
Independent variables and intercept (2)
Coefficients (3)
Standard-errors (4)
t-statistics (5)
p-values associated with the t-statistics (6)
testing the null hypothesis that a given coefficient is 0

95% confidence intervals (7)


(1)
(2)

(3)

(4)

(5)

(6)

(7)

LINEAR

REGRESSION

Predicting fitted values and residuals


predict wage_fitted
e.g. 1,304921 = 0,2843595 + 11*0,092029 + 2*0,0041211 + 0*0,0220672

predict wage_resid, r
e.g. -0,1735185 = 1,131402 1,304921

logwage

educ

exper

tenure

wage_fitted

wage_resid

1,131402

11

1,304921

-0,1735185

1,175573

12

22

1,523506

-0,3479329

1,098612

11

1,304921

-0,2063083

1,791759

44

28

1,819802

-0,0280429

1,667707

12

1,461690

0,2060172

2,169054

16

1,970451

0,1986027

2,420368

18

15

2,157168

0,2631997

1,609438

12

1,475515

0,1339233

1,280934

12

26

1,584125

-0,3031912

10

2,900322

17

22

21

2,402928

0,4973939

LINEAR

REGRESSION

Incorporating categorical information into regression models


Dummy variables (coded as 0/1) can be included as such in the
regression
regress wage educ exper tenure female

Categorical variables with more than two categories must be included


using the i. prefix
regress wage educ exper tenure i.region
Stata will automatically create dummy variables for each category and
incorporate them in the regression except the reference category
-

You can use the prefix ib(x). instead to change the reference category

LINEAR

REGRESSION

Post-estimation tests
Multicollinearity (Wooldridge, 2008 - chapter 3, p99)
estat vif
-

Rule of thumb, if variance inflation factor>10, multicollinearity problem

Normality of the residuals


sktest varname
-

testing the null hypothesis that variable follows a standard normal distribution

swilk|sfrancia varname
-

Shapiro-Wilk and Shapiro-Francia test

Homoskedasticity (Wooldridge, 2008 - chapter 8)


estat hettest
-

Breusch-Pagan test, testing the null hypothesis of homoskedasticity

estat imtest, white


-

White test, testing the null hypothesis of homoskedasticity

The [, robust] option after regress gives heteroskedasticity-robust standard errors

F-test: testing that a group of variables has no effect on the dependent


variable joint hypotheses test (Wooldridge, 2008 - chapter 4, p143)
test var1 var2

SECTION 3

DATA ANALYSIS

Descriptive statistics
Linear regression

Exporting results

EXPORTING

RESULTS

outreg2 allows to easily export the results of one or several


regressions
to Microsoft Office applications: Word, Excel
to LaTeX

outreg2 [estlist] using filename [, word excel tex]


[estlist] refers to the list of estimation results previously saved using the
command: estimates store estname

EXPORTING

regress logwage educ


estimates store est1
regress logwage educ exper tenure
estimates store est2

regress logwage educ exper tenure female


estimates store est3
outreg2 [est1 est2 est3] using output, word

RESULTS

Você também pode gostar