IRS Stata

GETTING
STARTED WITH
STATA
Sbastien Fontenay
ECON - IRES
THE
SOFTWARE
Software developed in 1985 by StataCorp

Functionalities
Data management
Statistical analysis
Graphics
Using Stata at UCL

Computer labs
Socrate 30, 31-32, 33, 34, 54 and 68

Dupriez 143
Leclercq 74, 76, 77 and 78
Student licence to install on your personal computer
valid during all your studies at the price of 20 euros
www.uclouvain.be/438229.html
FINDING
SUPPORT (1)
Best documentation
help command
search keyword
Stata website : www.stata.com/support

Frequently Asked Questions
Video tutorials
Statalist
Books
Cahuzac, E., Bontemps, C. (2008). Stata par la pratique: Statistiques,
graphiques et lments de programmation.
Cameron, A.C., Trivedi, P.K. (2009). Microeconometrics using Stata.
Becketti, S. (2013). Time series using Stata.
UCLA : www.ats.ucla.edu/stat/stata
FINDING
SUPPORT (2)
For all your questions related to data management or analysis using

Stata
Website: http://www.uclouvain.be/411370
Email: sebastien.fontenay@uclouvain.be
By appointment only:
Btiment Dupriez (office d010), 3 place Montesquieu
COURSE
Quick tour of
Stata
Working environment
Writing commands
Data
management
Inputting data
Transforming data
Data analysis
Descriptive statistics
Linear regression
Exporting results
TOPICS
SECTION 1
QUICK TOUR OF STATA
Working environment
Writing commands
WORKING
ENVIRONMENT
The working environment is composed of 5 windows
Results
of commands
Variables
list and labels
Review
Properties
of commands
of variables and
dataset
Command
window
WORKING
ENVIRONMENT
Three specific windows can be opened by clicking on the following icons

Data editor/browser
Display data in memory
Viewer
Display log and help files
Do-file Editor
Text editor to save/execute commands
There are 3 main types of files used in Stata

.dta data
.do commands (do-file)
.smcl | .log output (log file)
WORKING
Data
All software functionalities

are available from the dropdown menus
Useful when you are unsure
of commands to run or
unfamiliar with available
options
Every command issued in

this manner is echoed to the
review and results windows
e.g. sysuse auto.dta
ENVIRONMENT
Graphics
Statistics
WORKING
ENVIRONMENT
In order to use Stata effectively, you should always follow this

three-step process:
Open a do-file
Choose your working directory
cd "C:\Users\Me"
mkdir stata_training
cd stata_training
-
You can see the current working directory at the bottom left of the main window
Start a log file (saving commands and their output)

log using filename [, text append replace]
-
log close
log off | on
SECTION 1
QUICK TOUR OF STATA
Working environment
Writing commands
WRITING
COMMANDS
Stata commands use a common syntax:

[prefix :] command [varlist] [= exp] [if] [in] [, options]
The square brackets denote qualifiers that are optional
Italicized words are to be substituted by the user
varlist denotes a list of variables

exp is a mathematical expression
Stata is case sensitive! (i.e. UPPERCASE != lowercase)
WRITING
COMMANDS
Operators may be used to manipulate numerical or string variables
Arithmetic
Logical
Relational
+
*
/
^
&
|
!
~
>
<
>=
<=
==
~=
!=
addition
subtraction
multiplication
division
raised to power
and
or
not
not
greater than
less than
> or equal
< or equal
equal
not equal
not equal
Pay attention that a double equal sign (==) is used for equality
testing
WRITING
COMMANDS
Logical and relational operators are particularly useful with if

qualifiers to define the sample for analysis
The if qualifier at the end of a command means the command is to
use only the data specified
command if exp
list
list
list
list
make if foreign==1
if make=="Volvo 260"
make price if price>=5000 & price<=7000
make price if price<5000 | price>7000
Note that character strings are enclosed in double quotes
WRITING
COMMANDS
You can refer to a list of numbers using the following shorthand

1/30
1/l
f/-5
-5/l
1 to 30
1 until last number
first to 5th number before the end
last five numbers
Numlists are particularly useful with the in qualifiers to specify a

range of observations to be used
command in range
list in f/10
list in -10/l
list make price in 74
WRITING
COMMANDS
The by prefix repeats execution of a command on subsets of the data

subsets are groups of observations that take the same value in a given
variable (often a categorical variable)
by varname: command
-
by foreign: list make
If the dataset is not sorted, you should use the bysort prefix instead
bysort varname: command
SECTION 2
DATA MANAGEMENT
Inputting data
Transforming data
INPUTTING
DATA
To open a dataset in Stata format (.dta): use

use filename [, clear]
sysuse - open example datasets installed with Stata
To save a dataset in Stata format: save

save filename [, replace]
Stata can also import/export Excel files (.xls or .xlsx)

import excel filename [, firstrow]
export excel filename [, firstrow(variables)]
By default, Stata opens/saves a dataset from/in the current working

directory but you can specify
another directory: use | save "C:\Users\Me\Stata_training\dataset.dta"
a web address: use http://sites.uclouvain.be/datasupport/data/wage.dta
INPUTTING
Summary of the dataset

describe: information on dataset in memory
codebook: detailed description of variables
Further explore data in memory

count: number of observations
list: display data in the results window
Manipulate variables/observations
keep wage educ exper
drop in 1/10
sort wage
DATA
SECTION 2
DATA MANAGEMENT
Inputting data
Transforming data
TRANSFORMING
DATA
To create a new variable: generate

generate newvar = exp [if] [in]
exp may be a number, a character string or a mathematical function
generate constant = 1
-
Create a constant equal to 1
generate constant_text = "text"

-
Create a constant that contains the character string "text"
generate logwage = ln(wage)

-
Create a variable equal to the natural logarithm of wage
generate expersq = expr^2

-
Create a variable equal to the square of exper
TRANSFORMING
DATA
To create specific variables using time series operators

generate lag_gdp = L.gdp
Create a variable corresponding to the first lag of gdp
generate lead_gdp = F.gdp

Create a variable corresponding to the first lead of gdp
generate diff_gdp = D.gdp

Create a variable corresponding to the first difference of gdp
But before you should tell Stata that you are working with time series
data using the command: tsset
tsset time [, yearly monthly quarterly daily]
Using system variables

generate gdp_growth = ((gdp[_n] - gdp[_n-1]) / gdp[_n-1])*100
Create a variable equal to the growth rate of gdp
TRANSFORMING
DATA
To modify an existing variable: replace

replace wage=20 if wage>=20
To rename an existing variable: rename

rename wage hourly_wage
You can also add a brief description to the variable using labels
label variable educ "total years of education"
TRANSFORMING
DATA
When transforming data, one must be careful with missing values

Missing values in Stata are coded with a . (period)
Stata treats missing values as large numbers, higher than any other
values of a given variable
In certain cases you should use the if qualifier to exclude missing values
generate rich = (wage>15) if wage<.
|or|
generate rich = (wage>15) if wage!=.
|or|
generate rich = (wage>15) if !missing(wage)
SECTION 3
DATA ANALYSIS
Linear regression
Exporting results
DESCRIPTIVE
STATISTICS
Categorical variables
One-way table of frequencies
tabulate female
-
The option [, missing] displays the total frequency of missing observations
Two-way table of frequencies

tabulate female married
Continuous variables
summarize gives the number of observations, the mean, the standard
deviation, the minimum and maximum values
summarize wage educ
-
The option [, detail] displays the main quantiles, the highest and lowest five values, the
variance, as well as the skewness and kurtosis measures
Pearsons correlation coefficient

correlate varlist [, covariance]
DESCRIPTIVE
STATISTICS
Exploring data with graphs

Distribution of a continuous variable: histogram
histogram wage
-
the option [, normal] draws a normal density line on the plot
Scatter plot between two variables: scatter

scatter wage educ
Evolution of time series: tsline

tsline gdp
-
available only after tsset
SECTION 3
DATA ANALYSIS
Linear regression
Exporting results
LINEAR
REGRESSION
We seek to estimate the relationship between one dependent variable

and a set of independent variables
using the Ordinary Least Squares (OLS) estimator
Classical linear model assumptions (Wooldridge, 2008):
Model is linear in parameters

Data are random sample of the population
No perfect collinearity between independent variables
Zero conditional mean of error term
Homoskedasticity
Normality of the residuals
LINEAR
REGRESSION
The model we want to estimate:

log(wage) = 0 + 1education + 2experience + 3tenure + u
where:
-
wage is average hourly earnings in dollars
education is the number of years of education
experience is the number of years of labour market experience
tenure is the number of years with the current employer
In Stata:
regress logwage educ exper tenure
LINEAR
Stata output
REGRESSION
LINEAR
Analysis of variance
Sum of Squares (SS)
Explained variance (model)
Residual variance
Total variance
Degrees of freedom (df)

Mean Squares (MS)
SS divided by df
REGRESSION
LINEAR
REGRESSION
Overall model fit

Number of observations
F-statistic
p-value associated with the F-statistic
testing the null hypothesis that all of the model
coefficients are 0
R-squared
proportion of variance in the dependent variable
explained by the independent variables
-
SS(model) divided by SS(total)
Adjusted R-squared
Standard deviation of the error term
()
LINEAR
REGRESSION
Parameters estimates
Dependent variable (1)
Independent variables and intercept (2)
Coefficients (3)
Standard-errors (4)
t-statistics (5)
p-values associated with the t-statistics (6)
testing the null hypothesis that a given coefficient is 0
95% confidence intervals (7)

(1)
(2)
(3)
(4)
(5)
(6)
(7)
LINEAR
REGRESSION
Predicting fitted values and residuals

predict wage_fitted
e.g. 1,304921 = 0,2843595 + 11*0,092029 + 2*0,0041211 + 0*0,0220672
predict wage_resid, r
e.g. -0,1735185 = 1,131402 1,304921
logwage
educ
exper
tenure
wage_fitted
wage_resid
1,131402
11
1,304921
-0,1735185
1,175573
12
22
1,523506
-0,3479329
1,098612
11
1,304921
-0,2063083
1,791759
44
28
1,819802
-0,0280429
1,667707
12
1,461690
0,2060172
2,169054
16
1,970451
0,1986027
2,420368
18
15
2,157168
0,2631997
1,609438
12
1,475515
0,1339233
1,280934
12
26
1,584125
-0,3031912
10
2,900322
17
22
21
2,402928
0,4973939
LINEAR
REGRESSION
Incorporating categorical information into regression models

Dummy variables (coded as 0/1) can be included as such in the
regression
regress wage educ exper tenure female
Categorical variables with more than two categories must be included

using the i. prefix
regress wage educ exper tenure i.region
Stata will automatically create dummy variables for each category and
incorporate them in the regression except the reference category
-
You can use the prefix ib(x). instead to change the reference category
LINEAR
REGRESSION
Post-estimation tests
Multicollinearity (Wooldridge, 2008 - chapter 3, p99)
estat vif
-
Rule of thumb, if variance inflation factor>10, multicollinearity problem
Normality of the residuals

sktest varname
-
testing the null hypothesis that variable follows a standard normal distribution
swilk|sfrancia varname
-
Shapiro-Wilk and Shapiro-Francia test
Homoskedasticity (Wooldridge, 2008 - chapter 8)

estat hettest
-
Breusch-Pagan test, testing the null hypothesis of homoskedasticity
estat imtest, white

-
White test, testing the null hypothesis of homoskedasticity
The [, robust] option after regress gives heteroskedasticity-robust standard errors
F-test: testing that a group of variables has no effect on the dependent

variable joint hypotheses test (Wooldridge, 2008 - chapter 4, p143)
test var1 var2
SECTION 3
DATA ANALYSIS
Linear regression
Exporting results
EXPORTING
RESULTS
outreg2 allows to easily export the results of one or several

regressions
to Microsoft Office applications: Word, Excel
to LaTeX
outreg2 [estlist] using filename [, word excel tex]

[estlist] refers to the list of estimation results previously saved using the
command: estimates store estname
EXPORTING
regress logwage educ

estimates store est1
regress logwage educ exper tenure
regress logwage educ exper tenure female

outreg2 [est1 est2 est3] using output, word
RESULTS

IRS Stata

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

IRS Stata

Enviado por

Direitos autorais:

Formatos disponíveis

GETTING

Software developed in 1985 by StataCorp

Using Stata at UCL

Socrate 30, 31-32, 33, 34, 54 and 68

Student licence to install on your personal computer

valid during all your studies at the price of 20 euros

Stata website : www.stata.com/support

For all your questions related to data management or analysis using

QUICK TOUR OF STATA

The working environment is composed of 5 windows

Three specific windows can be opened by clicking on the following icons

There are 3 main types of files used in Stata

All software functionalities

Every command issued in

e.g. sysuse auto.dta

In order to use Stata effectively, you should always follow this

Start a log file (saving commands and their output)

QUICK TOUR OF STATA

Stata commands use a common syntax:

varlist denotes a list of variables

Stata is case sensitive! (i.e. UPPERCASE != lowercase)

Operators may be used to manipulate numerical or string variables

Logical and relational operators are particularly useful with if

Note that character strings are enclosed in double quotes

You can refer to a list of numbers using the following shorthand

Numlists are particularly useful with the in qualifiers to specify a

The by prefix repeats execution of a command on subsets of the data

by foreign: list make

To open a dataset in Stata format (.dta): use

To save a dataset in Stata format: save

Stata can also import/export Excel files (.xls or .xlsx)

By default, Stata opens/saves a dataset from/in the current working

Summary of the dataset

Further explore data in memory

To create a new variable: generate

Create a constant equal to 1

generate constant_text = "text"

Create a constant that contains the character string "text"

generate logwage = ln(wage)

Create a variable equal to the natural logarithm of wage

generate expersq = expr^2

Create a variable equal to the square of exper

To create specific variables using time series operators

generate lead_gdp = F.gdp

generate diff_gdp = D.gdp

Using system variables

To modify an existing variable: replace

To rename an existing variable: rename

When transforming data, one must be careful with missing values

generate rich = (wage>15) if !missing(wage)

The option [, missing] displays the total frequency of missing observations

Two-way table of frequencies

Pearsons correlation coefficient

Exploring data with graphs

the option [, normal] draws a normal density line on the plot

Scatter plot between two variables: scatter

Evolution of time series: tsline

available only after tsset

We seek to estimate the relationship between one dependent variable

Classical linear model assumptions (Wooldridge, 2008):

Model is linear in parameters

The model we want to estimate:

wage is average hourly earnings in dollars

education is the number of years of education