Você está na página 1de 10

Problem Set 9 Heteroskedasticty Answers

/* INVESTIGATION OF HETEROSKEDASTICITY */
First graph data
. u hetdat2
. gra manuf gdp, s([country].) xlab ylab

manufacturing output (US$ miilio

300000
france
uk

200000
korea
italy

canada

100000
spain
switzerl
mexico
netherla
belgium

sweden
turkey
denmark
finland
singapor
malaysia
ireland
portugal
norway
chile
hong kon
israel
greece
hungary
slovenia
syria
slovakia
kuwait

500000
1.0e+06
gdp (US $ million)

1.5e+06

Data are tightly packed with exception of a few outliers (Italy, UK and France)
Sometimes outliers can lead to heteroskedasticity. (Similar levels of GDP, big
variation in level of manufacturing output).
. reg manuf gdp
Source |
SS
df
MS
---------+-----------------------------Model | 1.1600e+11
1 1.1600e+11
Residual | 1.4312e+10
26
550464875
---------+-----------------------------Total | 1.3031e+11
27 4.8264e+09

Number of obs
F( 1,
26)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

28
210.73
0.0000
0.8902
0.8859
23462

-----------------------------------------------------------------------------manuf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------gdp |
.1936932
.0133428
14.517
0.000
.1662666
.2211197
_cons |
603.8754
5699.688
0.106
0.916
-11112
12319.75
As a first check for heteroskedasticity, inspect pattern of residuals. If H.is
present should see residual variance increasing (or decreasing) with value of
gdp.
. predict res, resid
/* this command gets residuals from previous
regression & saves them with the name res */

. gra res gdp, s([country]) yline(0) xlab ylab


100000
korea

Residuals

50000

uk

ireland
singapor
malaysia
turkey switzerl
chile
finland
slovenia
portugal belgium
slovakia
hungary
kuwait
israel
denmark
sweden
syria
greece
norway
hong kon
netherla

canada
france
spain

mexico

-50000

italy

500000
1.0e+06
gdp (US $ million)

1.5e+06

Graph is a little unclear. There is an increase in the residual spread with GDP,
but with the exception of France.
Need a more formal test to get round the ambiguity.
1) Goldfeld-Quandt Test
. sort gdp

/* sort data in ascending order */

Omit middle c observations (approx. 20% of total sample. In this example c=4)
So N-c/2 = 28-4/2 = 12
. reg manuf gdp if _n<=12
/* regression on first 12 observations. These
should have smallest residual variance, if residual variance increases with
level of GDP.
Source |
SS
df
MS
---------+-----------------------------Model |
438802283
1
438802283
Residual |
157002655
10 15700265.5
---------+-----------------------------Total |
595804938
11 54164085.2

Number of obs
F( 1,
10)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

12
27.95
0.0004
0.7365
0.7101
3962.4

-----------------------------------------------------------------------------manuf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------gdp |
.2293109
.0433754
5.287
0.000
.1326644
.3259573
_cons | -607.1009
2598.401
-0.234
0.820
-6396.699
5182.497
Now repeat regression for top 12 in the data set

. reg manuf gdp if _n>=17


Source |
SS
df
MS
---------+-----------------------------Model | 5.6947e+10
1 5.6947e+10
Residual | 1.3597e+10
10 1.3597e+09
---------+-----------------------------Total | 7.0544e+10
11 6.4131e+09

Number of obs
F( 1,
10)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

12
41.88
0.0001
0.8073
0.7880
36874

-----------------------------------------------------------------------------manuf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------gdp |
.1879586
.029043
6.472
0.000
.1232468
.2526705
_cons |
5556.995
18758.07
0.296
0.773
-36238.59
47352.58

Goldfeld-Quandt Test =

RSShigh/RSSlow

~ F(N-c-2k/2, N-c-2k/2)

Where N= full sample size


C= number observations dropped from middle
K = number parameters in estimated equation

1.358E10/1.57E8

~ F(28-4-2(2)/2, 28-4-2(2)/2)

= 86.5

~ F(11,11)

From tables, critical value at 5% level for F(11,11) = 2.82


So estimated F > Fcritical
So reject null that data are homoskedastic.
2) Breusch-Pagan Test
Idea is to regress a proxy for the residual variance on the variables
thought to be causing the Het. problem, (or if not sure which variables, include
all of them in this auxiliary regression). Hence this is a more general test for
Het. than Goldfeld-Quandt test.
^2

Since residual variance s2 = RSS/N-k = u /N-k, then square of estimated


^2

residual for each observation, u , is a good proxy for the residual variance.
^2

Hence a regression of u

= d0 + d1GDPi + eI

This will suggest Het. if the coeffcicient on GDP in this auxiliary regression
is significantly different from zero (residual variance correlated with rhs
variable).
. predict reshat, resid

/* so 1st save residuals */

. g res2=reshat^2

/* square them */

Then regress square of residuals on all rhs variables from original regression
(in this case just GDP).

. reg res2 gdp


Source |
SS
df
MS
-------------+-----------------------------Model | 6.6744e+18
1 6.6744e+18
Residual | 5.5687e+19
26 2.1418e+18
-------------+-----------------------------Total | 6.2361e+19
27 2.3097e+18

Number of obs
F( 1,
26)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

28
3.12
0.0893
0.1070
0.0727
1.5e+09

-----------------------------------------------------------------------------res2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gdp |
1469.225
832.2865
1.77
0.089
-241.5643
3180.015
_cons |
1.17e+08
3.56e+08
0.33
0.745
-6.14e+08
8.48e+08
Breusch-Pagan Test is N*R2auxiliary = 28*0.107 = 3.00
This has a Chi-squared distribution with degrees of freedom equal to the number
of right hand side variables in the auxiliary regression excluding the constant,
(in this case 1).
From tables, critical value at 5% level is 3.84
So estimated 2 < 2critical , so accept null that residuals are not heteroskedastic
Note that an equivalent test is to take the F test of goodness of fit of the
model as a whole from the auxiliary regression)

If the exact form of H. is not known (and this is likely to be the case for most
estimations. H. will vary across more than one variable -so these tests and
methods above are not valid and it is unlikely that can say with certainty
what the true functional form of H. looks like)
In this case use White Adjusted standard errors (see lecture notes), which fix
up the biased OLS estimates. (Note that these are not the same as if you knew
the true form of H. but they are better than the unadjusted OLS ones).
Original regression
. reg manuf gdp
Source |
SS
df
MS
---------+-----------------------------Model | 1.1600e+11
1 1.1600e+11
Residual | 1.4312e+10
26
550464875
---------+-----------------------------Total | 1.3031e+11
27 4.8264e+09

Number of obs
F( 1,
26)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

28
210.73
0.0000
0.8902
0.8859
23462

-----------------------------------------------------------------------------manuf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------gdp |
.1936932
.0133428
14.517
0.000
.1662666
.2211197
_cons |
603.8754
5699.688
0.106
0.916
-11112
12319.75

Now with adjusted standard errors (you should know what the adjustment does
see lecture notes).
. reg manuf gdp, robust
Regression with robust standard errors

Number of obs =
F( 1,
26) =
Prob > F
=
R-squared
=
Root MSE
=

28
116.39
0.0000
0.8902
23462

-----------------------------------------------------------------------------|
Robust
manuf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------gdp |
.1936932
.0179542
10.788
0.000
.1567879
.2305985
_cons |
603.8754
3542.399
0.170
0.866
-6677.629
7885.38
consistent standard error is now larger (.108 compared with .013) and hence the
t value is lower and the confidence interval larger than in the unadjusted
regression.
The reason you might worry is that this adjustment is only valid asymptotically
(as the sample size gets very large). With only 28 observations this is far from
being valid. So the adjusted standard errors and t stats. may be as wrong as the
original OLS ones.

2.

Sometimes heteroskedasticty exists within sub-samples of your data amongst


variables not included in your regression. In this question you test
whether the residual variances differ across the male and female subsamples of the data. In this case there are equal numbers of men and women
in the data set so ok to use Goldfeld-Quandt test */

1st do Breusch-Pagan test again


/* read in data set hetwage.dta */

. u hetwage
. reg logpay exper exper2
Source |
SS
df
MS
-------------+-----------------Model | 18.7214801
2 9.36074005
Residual | 301.357698
967
.31164188
-------------+-----------------------------Total | 320.079178
969 .330319069

Number of obs
F( 2,
967)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

970
30.04
0.0000
0.0585
0.0565
.55825

---------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+-------------------------------------------------------------exper |
.0408645
.0052725
7.75
0.000
.0305177 .0512114
exper2 | -.0008537
.0001144
-7.46
0.000
-.0010782 -.0006291
_cons |
1.626793
.0529785
30.71
0.000
1.522827 1.730759
. predict res, resid
. g res2=res^2

/* save residuals and square them */

. reg res2 exper exper2


Source |
SS
df
MS
-------------+-----------------------------Model | .945424049
2 .472712024
Residual | 281.289645
967 .290888981
-------------+-----------------------------Total | 282.235069
969 .291264261

Number of obs
F( 2,
967)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

970
1.63
0.1974
0.0033
0.0013
.53934

-----------------------------------------------------------------------------res2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exper |
.0034881
.0050939
0.68
0.494
-.0065083
.0134846
exper2 | -.0000213
.0001106
-0.19
0.847
-.0002383
.0001956
_cons |
.2494289
.0511841
4.87
0.000
.1489842
.3498737
-----------------------------------------------------------------------------Breusch-Pagan Test is N*R2auxiliary = 970*0.0033 = 3.20
This has a Chi-squared distribution with degrees of freedom equal to the number
of right hand side variables in the auxiliary regression excluding the constant,
(in this case 1).
From tables, critical value at 5% level is 3.84
So estimated 2 < 2critical , so accept null that residuals are not heteroskedastic

Sometimes heteroskedastcity can be cause by variables that do not appear in the


original regression (but perhaps should).
. reg logpay exper exper2

/* pooled regression on men and women together */

Source |
SS
df
MS
---------+-----------------------------Model | 18.7214801
2 9.36074005
Residual | 301.357698
967
.31164188
---------+-----------------------------Total | 320.079178
969 .330319069

Number of obs
F( 2,
967)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

970
30.04
0.0000
0.0585
0.0565
.55825

-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------exper |
.0408645
.0052725
7.750
0.000
.0305177
.0512114
exper2 | -.0008537
.0001144
-7.460
0.000
-.0010782
-.0006291
_cons |
1.626793
.0529785
30.707
0.000
1.522827
1.730759
-----------------------------------------------------------------------------Regression suggests that pay rises with work experience but at a decreasing
rate, (the coefficient on the quadratic is negative)
Now do separate regressions for
a) women
. reg logpay exper exper2 if female==1
Source |
SS
df
MS
---------+-----------------------------Model | 7.70631111
2 3.85315556
Residual | 135.033678
482
.28015286
---------+-----------------------------Total | 142.739989
484 .294917334

Number of obs
F( 2,
482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

485
13.75
0.0000
0.0540
0.0501
.52929

-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------exper |
.035764
.0074775
4.783
0.000
.0210716
.0504564
exper2 | -.0008903
.0001719
-5.180
0.000
-.0012281
-.0005526
_cons |
1.587002
.0694657
22.846
0.000
1.450509
1.723495
-----------------------------------------------------------------------------b) men
. reg logpay exper exper2 if female==0
Source |
SS
df
MS
---------+-----------------------------Model | 17.9038006
2 8.95190032
Residual | 130.932656
482 .271644515
---------+-----------------------------Total | 148.836457
484
.30751334

Number of obs
F( 2,
482)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

485
32.95
0.0000
0.1203
0.1166
.5212

-----------------------------------------------------------------------------logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------exper |
.0539586
.0068722
7.852
0.000
.0404554
.0674618
exper2 | -.0009829
.0001415
-6.944
0.000
-.001261
-.0007048
_cons |
1.598233
.0727244
21.977
0.000
1.455337
1.741129
------------------------------------------------------------------------------

In this case
Goldfeld-Quandt Test =

RSShigh/RSSlow

~ F(N-c-2k/2, N-c-2k/2)

Becomes RSSwomen/RSSmen
(since women are the high variance sub-sample look at the RSS in the output
above)
Where N= full sample size = 970
C= number observations dropped from middle (zero in this case )
K = number parameters in estimated equation (3)
=

135.0/130.9

= 1.03

~ F(970 -2(2)/2, 970-2(2)/2)


~ F(483, 483)

From tables, critical value at 5% level for F(483,483) = 1.00


So estimated F > Fcritical
So again reject null that data are homoskedastic, (though this is very
borderline and in most cases will not make much difference if pool the data)
Fix up combined regression using White-adjusted standard errors.
. reg logpay exper exper2, robust
Regression with robust standard errors

Number of obs =
970
F( 2,
967) =
31.44
Prob > F
= 0.0000
R-squared
= 0.0585
Root MSE
= .55825
-----------------------------------------------------------------------------|
Robust
logpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------exper |
.0408645
.0052659
7.760
0.000
.0305307
.0511984
exper2 | -.0008537
.0001216
-7.020
0.000
-.0010923
-.000615
_cons |
1.626793
.0482309
33.729
0.000
1.532144
1.721442

Note sample size is 970, so do not have to worry too much about small sample
effects on White adjusted standard errors.

3. You have data on firm size, N, and the level of profits, PROF, measured in million, for 118

firms.
You estimate the following regression:

(1)

^
PROF = 18,000 +
(12,000)

1,000 N (100)

1.0N2
(0.25)

R2 = 0.35
RSS = 20,000

where the numbers in brackets are estimated standard errors


You then run the regression twice more, first with the 50 firms with the lowest profits, then with the
50 firms with the highest profits. The residual sums of squares in the two regressions are 10,000
and 35,000 respectively.
i)

Interpret the results. At what firm size are profits maximised?

Should recognise that firm size variables are statistically significant (t values >2) Constant is not.
Constant gives notional amount of profit when firm size is zero. R-squared suggests 35% of
variation in profits is explained by right hand side variables. Coefficients suggest profits rise then
fall with firm size
Profits are maximised when dProf/dN = 0

(1st order condition for a max)

Where (from question) dProf/dN = 1000 2*1.0N = 0


2N = 1000
N= 500
ii) What do you understand by the term heteroskedasticity? What causes heteroskedasticity? What
are the implications for OLS estimation if heteroskedasticity exists?
H. means that is residual variance not constant.
Can be caused by incorrect functional form in dependent or dependent variables (try taking
logs). Otherwise nature of behavioural agents (firms, individuals) means that variation in
sample populations often depends on the values of the right hand side variables under study.
Consequences. OLS estimates still unbiased. (see lecture notes for proof). Standard errors are
biased (and therefore t and F values and confidence intervals also biased)
iii)

Perform a test for heteroskedasticity in the model. Comment on your results.

Given information in the question the only test you could do is the Goldfeld-Quandt test for
heteroskedasticity
= RSShigh variance sample/RSSlow variance sample

~ F(N-c-2k/2, N-c-2k/2)

Where N= full sample size


C= number observations dropped from middle
K = number parameters in estimated equation
=

35000/10000 ~ F(118-18-2(3)/2, 118-18-2(3)/2)

3.5

~ F(47,47)

From tables, critical value at 5% level for F(16,16) = 2.32


So estimated F > Fcritical
So reject null hypothesis of no heteroskedasticity

Você também pode gostar