Você está na página 1de 65

Advanced Econometrics Dr.

Uma Kollamparambil

Today's Agenda
A quick round-up of basic econometrics Econometric Modelling

Regression analysis Theory specifies the functional relationship Measurement of relationship uses regression analysis to arrive at values of a and b.
Y = a + bX + e

Components dependent & independent variables, intercept (O), coefficients, error term Regression may be simple or multivariate according to the no. of independent variables

Requirements
Model Specification: relationship between dependent and independent variables
scatter plot specify function that best fits the scatter

Sufficient Data for estimation


cross sectional time series panel

Some Important terminology

Least Squares Regression: Y = a + bX + e Estimation


point estimate Interval estimates

Inference
t-statistic R-square or Coefficient of Determination F-statistic

Estimation -- OLS
18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50

Ordinary Least Squares (OLS)


We have a set of datapoints, and want to fit a line to the data The most efficient can be shown to be OLS. His minimizes the squared distance between the line and actual data points.

How to Estimate a and b in the linear equation? The OLS estimator solves:
Mina;b [Yi a bX i ] Min
[Yi a bX i ]

a ;b

This minimization problem can be solved using calculus. The result is the OLS estimators of a and b

Regression Analysis -- OLS


Yj = a + b X j + j ( X X )( Y Y ) = b (X X )
i i 2 i

The basic equation

OLS estimator of b

X =Y b a

OLS estimator of a

Here a hat denotes an estimator, and a bar a sample mean.

Regression Analysis -- OLS


production period 1 2 3 4 5 6 7 Demand (Y) Price (X) 410 1 370 5 240 8 230 10 160 15 150 23 100 25

Intercept Q (X)

Coefficients 384.98 -11.89

These are the estimated coefficients for the data above.

Regression Analysis -- Inference


R =
2 2 ( Y Y ) i

(Y Y )
i

Sb =

(n k ) ( X i X ) 2

2 (Yi Yi )

Here, the R-squared is a measure of the goodness of fit of our model, while the standard deviation of b gives us a measure of confidence for out estimate of b.

Regression Analysis -- Confidence


SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.954112475 Adjusted R Square 0.94493497 Standard Error 27.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.47725 76274.48 103.9621 0.000155729 3668.379888 733.676 79942.85714

Intercept Q (X)

Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156

These are goodness of fit measures reported by excel for our example data.

Hypothesis testing
Hypothesis formulation Test:
Confidence interval method: Construct interval of estimated b at desired level of confidence & SE of b. check if b falls within. If it does, accept null hypothesis Test of significance method: Estimate t-value of b and compare with table t value. If former less than latter accept the null hypothesis.

Hypothesis testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.954112475 Adjusted R Square 0.94493497 Standard Error 27.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.47725 76274.48 103.9621 0.000155729 3668.379888 733.676 79942.85714

Intercept Q (X)

Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156

b = Sb

the t-ratio. Combined with information in critical values from a student-t distribution, this ratio tells us how confident we are that a value is significantly different from zero.

Analysis of Variance: F ratio


F ratio tests the overall significance of the regression.
Explained var iation F = Un exp lained var iation ( k 1) (n k )

F =

( k 1) (1 R ) /( n k )
2

R2

Tests the marginal contribution of new variable


ESS new ESS old F= RSS new ( no .ofnewX ) ( n no.ofXnew mod el )

Tests for structural change in data


RSS R RSS UR F = (k ) RSS UR ( n1 + n 2 2 k )

Multivariate regression
Y1 1 Y = 1 2 Yn 1

X 21 X 22 X 2n
/ /

= (X X ) X Y
/

y = X + u

X k1 1 u X k 2 2 + u X kn 3 u

1 2 3

Assumptions of OLS regression


Model is correctly specified & is linear in parameters X values are fixed in repeated sampling and y values are continuous & stochastic Each uiis normally distributed with mean of ui=0 Equal variance ui (Homoscedasticity) No autocorrelation or no correlation between ui and uj Zero covariance between Xi and ui No multicollinearity Cov(Xi Xj)=0 , multivariate regression Under assumption of CNLRM estimates are BLUE

Regression Analysis : Some problems


Autocorrelation: covariance between error terms Identification : DW d test 0-4 (near 2 indicates no autocorrelation) R2 is overestimated t and F tests misleading Missed Variable: Correctly specify Consider AR scheme Heteroscedasticity: Non-constant variance - Detection: scatter plot of error terms, park test, goldfeld-Quandt test, white test etc

Regression Analysis : Some problems


- t and F tests misleading - Remedial measures include transformation of variables through WLS Muticollinearity: covariance between various X variables Detection: high R2 but t test insignificant, high pair-wise correlation between explanatory variables t and F tests misleading remove model over-specification, use pooled data, transform variables

Model Specification

Sources of misspecification
Omission of relevant variable Inclusion of unnecessary variables Wrong functional form Errors of measurement Incorrect specification of the stochastic error term

Model Specification Errors: Omitting Relevant Variables and Including Irrelevant Variables
To properly estimate a regression model, we need to have specified the correct model A typical specification error occurs when the estimated model does not include the correct set of explanatory variables This specification error takes two forms Omitting one or more relevant explanatory variables Including one or more irrelevant explanatory variables Either form of specification error results in problems with OLS estimates

Model Specification Errors: Omitting Relevant Variables


Example: Two-factor model of stock returns Suppose that the true model that explains a particular stocks returns is given by a two-factor model with the growth of GDP and the inflation rate as factors

rt = 0 + 1GDPt + 2 INFt + t

Suppose instead that we estimated the following model

rt = 0 + 1GDPt + t
t = 2 INFt + t

Thus, the error term of this model is actually equal to If there is any correlation between the omitted variable (INF) and the explanatory variable (GDP), then there is a violation of classical assumption Cov(uiXi)=0

Model Specification Errors: Omitting Relevant Variables


This means that the explanatory variable and the error term are not uncorrelated If that is the case, the OLS estimate of 1 (the coefficient of GDP) will be biased As in the above example, it is highly likely that there will be some correlation between two financial (or economic) variables If, however, the correlation is low or the true coefficient of the omitted variable is zero, then the specification error is very small

When Cov(X1X2)#0, Estimate of both constant & slope biased Bias continues even with larger sample When Cov(X1X2)=0, constant is biased, slope unbiased Variance of error is incorrectly estimated Consequently, variance of slope is biased Leads to misleading conclusions through confidence interval and hypothesis testing procedures regarding statistical significance of the estimated parameters. Forecasts therefore based on mis-specified model will be unreliable

Model Specification Errors: Omitting Relevant Variables

To avoid omitted variable bias, A simple solution is to add the omitted variable back to the model, but the problem with this solution is to be able to detect which is the omitted variable Omitted variable bias is hard to detect, but there could be some obvious indications of this specification error. The best way to detect the omitted variable specification bias is to rely on the theoretical arguments behind the model. - Which variables does the theory suggest should be included?

Model Specification Errors: Omitting Relevant Variables


- What are the expected signs of the coefficients? - Have we omitted a variable that most other similar studies include in the estimated model?

Note, though, that a significant coefficient with the unexpected sign can also occur due to a small sample size However, most of the data sets used in empirical finance are large enough that this most likely is not the cause of the specification bias.

Model Specification Errors: Including Irrelevant Variables


Example: Going back to the two-factor model, suppose that we include a third explanatory variable in the model, for example, the degree of wage inequality (INEQ) So, we estimate the following model
rt = 0 + 1GDPt + 2 INFt + 3 INEQt + t

The estimated coefficients (both constant and slope) are unbiased The variance of the error term is estimated accurately

Model Specification Errors: Including Irrelevant Variables


However the variance of the coefficients are inefficient The inclusion of an irrelevant variable (INEQ) in the model increases the standard errors of the estimated coefficients and, thus, decreases the tstatistics This implies that it will be more difficult to reject a null hypothesis that a coefficient of one of the explanatory variables is equal to zero

Model Specification Errors: Including Irrelevant Variables


Also, the inclusion of an irrelevant variable will usually decrease the adjusted R-sq (but not the R-sq) Overspecified model Considered to be a lesser evil compared to underspecified model But other problems like multicollinearity, loss of degrees of freedom

Model Specification Criteria


To decide whether an explanatory variable belongs in a regression model, we can test whether most of the following conditions hold The importance of theory: Is the decision to include an explanatory variable in the model theoretically sound? t-Test: Is the variable statistically significant and does it have the expected coefficient sign? Adjusted R2: Does the overall fit of the model improve when we add the explanatory variable? Bias: Do the coefficients of the other variables change significantly (sign or statistical significance) when we add the variable to the model?

Problems with Specification Searches


In an attempt to find the right or desired model, a researcher may estimate numerous models until an estimated model with the desired properties is obtained It is definitely the case that the wrong approach to model specification is data mining In this case, the researcher would estimate every possible model and choose to report only those that produce desired results The researcher should try to minimize the number of estimated models and guide the selection of variables mainly on theory and not purely on statistical fit

Sequential Model Specification Searches


In an effort to find the appropriate regression model, it is common to begin with a benchmark (or base) specification and then sequentially add or drop variables The base specification can rely on theory and then add or drop variables based on adjusted R2 and tstatistics In this effort, it is important to follow the principle of parsimony: try to find the simplest model that best fits the data Make use of the F test for incremental contribution of variables

F test for incremental contribution of variables


Very useful test in deciding if a new variable should be retained in the model e.g. Return on a stock is a function of GDP and inflation of the country. Question is should we include inflation in the model. Estimate a model without inflation and get Rsq(old).

F test for incremental contribution of variables


Re-estimate including inflation and get its Rsq(new).

( R R ) / no.of `new`parameters F= 2 (1 Rnew ) / n knew


Ho: Addition of new variable does not improve the model H1: Addition of new variable improves the model If estimated F is higher than critical F table value, reject null hypothesis. It means inflation needs to be included in the above example.

2 new

2 old

Nominal vs. True level of Significance


Model derived from data mining should be assessed not at conventional levels of significance ( ) such as 1,5,10% To begin with if there were c candidate regressor of which k are selected after data mining, true level of significance ( ) is related to nominal significance level as: c/k *
*

= 1 (1 ) * (c / k )

* If c=2, k=1 and =5% then =10%

Model Specification: Choosing the Functional Form


One of the assumptions to derive the nice properties of OLS estimates is that the estimated model is linear What if the relationship between two variables is not linear? OLS maintains its nice properties of unbiased and minimum variance estimates if we transform the non-linear relationship into a model that is linear in the coefficients Interesting case Double-log (log-log) form

Model Specification: Choosing the Functional Form

Example: A well-known model of nominal exchange rate determination is the Purchasing Power Parity (PPP) model s = P/P* s = nominal exchange rate (e.g. rand/$), P = price level in the SA, P* = price level in the US

Taking natural logs, we can estimate the following model ln(s) = 0 + 1ln(P) + 2ln(P*) + i

Model Specification: Choosing the Functional Form

Property of double-log model: Estimated coefficients show elasticities between dependent and explanatory variables

Example: A 1% change in P will result in a 1% change in the nominal exchange rate (s).

How do we know if weve gotten the right functional form for our model? - Expected coefficient signs, R2, t-stat and DW dstat

Model Specification: Choosing the Functional Form


If not satisfactory, Examine the error terms use economic theory to guide you Weve seen that a linear regression can really fit nonlinear relationships Can use logs on RHS, LHS or both Can use quadratic forms of xs Can use interactions of xs

How to choose Functional Form


Think about the interpretation. Does it make more sense for x to affect y in percentage (use logs) or absolute terms? Does it make more sense for the derivative of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?

How to choose Functional Form (cont'd)


We already know how to test joint exclusion restrictions to see if higher order terms or interactions belong in the model It can be tedious to add and test extra terms, plus may find a square term matters when really using logs would be even better A test of functional form is Ramseys regression specification error test (RESET)

DW test for model misspecification


You suspect that relevant variable Z ( might be a polynomial of existing X) was omitted from the assumed model. From the assumed model, obtain OLS residuals. Order residuals according to increasing values of Z Compute d stat from thus ordered residuals. If autocorrelation is noticed, then the model is misspecified.

Ramseys RESET
Regression Specification Error Test Estimate assumed model and derive Then, estimate y = 0 + 1x1 + + kxk + 12 + 13 +error and test H0: 1 = 0, 2 = 0
2 2 ( Rnew Rold ) / no.of `new`parameters F= 2 (1 Rnew ) / n k new

If Ho rejected it indicates mis-specified model Advantage , in RESET you dont have to specify the the correct alternative model Disadvantage: doesn't help in attaining the right model

Lagrange Multiplier Test for Adding variable


Y=b0+bX1 Y=b0+b1X1+b2X2+b3X3 1 (restricted) 2 (UR)

Obtain residuals from 1 and regress it on all X in Eq2 including ones in eq1 Ui=a0+a1X1+a2X2+a3X3

nR
2

2 no .ofrestrictions

If estimated Chi-sq>critical chi-sq, reject the restricted regression.

Nested vs. Non-nested Models


Nested: Y=a+b1X1+b2X2+b3X3+b4X4 Y=a+b1X1+b2X2 Specification test and the restricted F test can be used to test for model specification errors Non-nested: Y=a+b1X1+b2X2 Y=c0+c1Z1+c2Z2

Tests for Non-nested Models


1) Discrimination approach: simply select better model based on goodness of fit
Rsq, Adj-Rsq, AIC, SIC, SBC ESS R = TSS
2

R = 1 (1 R 2 )

n 1 nk

RSS AIC = e n k / n RSS SIC = n n


2k / n

2)Discerning approach: make use of information provided by other models as well along with the initial model

Non-nested Discerning Tests


If the models have the same dependent variables, but non-nested xs could still just make a giant model with the xs from both and test joint exclusion restrictions that lead to one model or the other. Y=a+b1X1+b2X2 Y=c0+c1Z1+c2Z2 Y=a+b1X1+b2X2+c1Z1+c2Z2 Use F test using both equations as reference model 2 2 in turns. ( Rnew Rold ) / no.of `new`parameters
F=
2 (1 Rnew ) / n k new

Davidson-MacKinnon J test,
An alternative, the Davidson-MacKinnon test, uses from one model as regressor in the second model and tests for significance. Y=a+b1X1+b2X2 - A Y=c0+c1Z1+c2Z2 - B Estimate B and obtain Y^B Y=a+b1X1+b2X2+ b3Y^B

Davidson-MacKinnon J test,

Use t-test, if b3=0, not rejected, we accept model A Reverse the models and re-do steps More difficult if one model uses y and the other uses ln(y) Can follow same basic logic and transform predicted ln(y) to get for the second step In any case, Davidson-MacKinnon test may reject neither or both models rather than clearly preferring one specification

Measurement Error
Sometimes we have the variable we want, but we think it is measured with error Examples: A survey asks how many hours did you work over the last year, or how many weeks you used child care when your child was young Consequences of Measurement error in y different from measurement error in x

Measurement Error: Dependent Variable


Y* is not directly measurable, it is measured wrongly as y=y*+ e Thus, really estimating y = (0 + 1x1 + + kxk + e)+u When will OLS produce unbiased results? Only if E(e) =E(u)=0, e is uncorrelated with xj & u, is unbiased But has larger variances than with no measurement error

Measurement Error: Explanatory Variable


x* is not directly measurable, it is measured wrongly as X=X*+ e Define measurement error as e1 = x1 x1* y= 0 + 1(x1 -e)+u Really estimating y = 0 + 1x1 + (u 1e1)

Measurement Error: Explanatory Variable

Assume E(e1) = 0 , cov(ei,ej)=0 , cov(ei,ui)=0

The effect of measurement error on OLS estimates depends on our assumption about the correlation between e1 and x1

If Cov(x1, e1) # 0, OLS estimates are biased, and variances larger Use Proxy or IV variables

Proxy Variables
What if model is mis-specified because no data is available on an important x variable? It may be possible to avoid omitted variable bias by using a proxy variable A proxy variable must be related to the unobservable variable But must be uncorrelated with the error term Sargen test

Lagged Dependent Variables


What if there are unobserved variables, and you cant find reasonable proxy variables? May be possible to include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of y Obviously, you must think past and current y are related for this to make sense

Missing Data Is it a Problem?


If any observation is missing data on one of the variables in the model, it cant be used If data is missing at random, using a sample restricted to observations with no missing values will be fine A problem can arise if the data is missing systematically say high income individuals refuse to provide income data

Non-random Samples
If the sample is chosen on the basis of an x variable, then estimates are unbiased If the sample is chosen on the basis of the y variable, then we have sample selection bias Sample selection can be more subtle Say looking at wages for workers since people choose to work this isnt the same as wage offers

Outliers
Sometimes an individual observation can be very different from the others, and can have a large effect on the outcome Sometimes this outlier will simply be do to errors in data entry one reason why looking at summary statistics is important Sometimes the observation will just truly be very different from the others

Outliers (cont'd)
Not unreasonable to fix observations where its clear there was just an extra zero entered or left off, etc. Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliers Can use Stata to investigate outliers

Model Selection Criteria


Be data admissible: Prediction must be realistic Be consistent with theory Have weakly exogenous regressors Exhibit parameter constancy: values and signs must be consistent Exhibit data coherency: white noise residuals Be encompassing

Matrix Approach to OLS

Nx1

n x k+1

k+1 x 1

nx1

b = (X'X)-1X'Y

Assumptions
E(u)=0 where u and 0 are n x 1 column vectors, 0 being a null vector.
2I

E(uu`)= where I is an n x n identity matrix (homoscedasticity and no autocorrelation) N x k matrix X I non-stochastic

Assumptions

The rank of X is p(X)=k, where k is the number of columns in X and k is less than the number of observations, n (no multi-collinearity) I x = 0
I x = 0 where I is a 1 x k row vector and x is a k x 1

column vector.

xu=vector 0 The has a multivariate normal distribution


I

i.e. U~N(0, 2 I )

R =

X y nY
I

^ I

yI y nY

var cov( ) = ( X X )
^ 2 I

u =

u u = nk nk

^ 2 i

^I ^

Você também pode gostar