Você está na página 1de 7

STAT-S-301 PROBLEM SET 1 SOLUTIONS

TA: Elise Petit


November 2016

Construct a scatterplot of

birthweight

on

smoker.

Is there a relationship

between the variables? Explain.

When plotting the infant's birthweight against the smoker variable, one
can see that two main lines appear: this is due to the fact that the smoker
variable is binary. However, we can still try to see if there is a relationship
between these two variables by looking at the distribution of each line:
we see that birthweight is more dispersed and reaches higher values for
non-smoking mothers compared to smoking mothers, hence we suspect a
negative relationship between smoking and birthweight.
The tted values line represent the values of birthweight that would be
predicted for dierent values of smoker using a simple regression model.
Since smoker is a binary variable, only the two extremities can be interpreted and the line can be seen as a connector of these two points: this
shows a negative impact of smoking on birthweight.

However, no strong conclusions can be drawn from that plot: we will need
to conduct further analyses to see if the suspected negative relationship

Conclusion: suspicion
of negative relationship but no clear evidence
between smoking and birthweight is conrmed.

What is the average birth weight for all mothers? mothers who smoke? mothers who do not smoke? Estimate the dierence in average between these
two types and construct a 95% condence interval for that dierence. Does
smoking seem to aect birth weight? Explain.
Average birth weight
All mothers

3382.93

Mothers who smoked

3178.83

Mothers who didn't smoke

3432.06

The dierence in average birthweight between mothers who didn't smoke


and mothers who smoked is estimated to be 253.23, with the true dierence
belonging to the condence interval [200.38,306.07] at 95% certainty. Since
zero is not included in this condence interval, we can reject the hypothesis
that there is no dierence between the two subsample average.

Hence,

smoking seems to signicantly aect birth weight at a 95% condence


level.

Using all observations, estimate the following regression:

birthweighti = 1 + 2 smokeri + i

(a) What is the estimated slope


b1 ? What do they mean?

b2

and the estimated intercept

After running a regular regression, we rst check for heteroscedasticity


using the Breusch-Pagan test. Since we have a p-value of 0.79 we cannot
reject the null hypothesis of homoscedasticity at any decent signicance
level. We cannot conclude that there is heteroscedasticity in our model
but it is always better to use heteroskedasticity-robust standard errors
rather than homoskedasticity-only ones (since the former are smaller, thus
leading to smaller standard deviation and more reliable estimations).
Our OLS estimators give:

b2 = 253.23, which is the expected decrease in

birthweight when the mother smoked compared to a mother who did not
smoke, and

b1 = 3432.06,

which is the expected birthweight for mothers

who didn't smoke.

b) How are they linked with what you did in point 2? Can you
conclude that smoking has a signicant eect on birth weight ?
(

b1

corresponds to the average birthweight computed among mothers who

didn't smoke while

b2

corresponds to the estimated dierence in average

birthweight between mothers who didn't smoke and mothers who smoked.

We had already concluded that this dierence was signicantly dierent


from zero which means that smoking has an eect. We can now also use
our regression results to check that smoker is a signicant variable on birth
weight by looking at the result of the single coecient signicance test.
Since the p-value associated with the signicance test operated on smoker
is null, we can reject the null hypothesis that the coecient is equal to zero
and again conclude that smoker has a signicant impact on birthweight.

Estimate the new following regression:

birthweighti = 1 + 2 smokeri + 3 alcoholi + 4 nprevisti + i

(a) Explain why the exclusion of alcohol and nprevist could lead
to omitted variable bias in the regression estimated in point 3.
Does the estimated eect of smoking on birth weight dier from
the estimation in point 3?
There is omitted variable bias if the omitted variable is correlated with
the included regressor and is also a determinant of the dependent variable.
Let's rst check if alcohol and nprevist are correlated with smoker (the
regressor of the regression estimated in point 3):
correlation

smoker

alcohol

12.09%

nprevist

-10.86%

Both variables are signicantly correlated with smoker, but the correlation
is quite low.
We can now check if alcohol and nprevist are determinants of birthweight
by running two simple regression models of birthweight against alcohol
then birthweight against nprevist. Looking at their p-value, we see that
alcohol is signicant in a simple regression with birthweight only at the
6.5% signicance level, while nprevist is signicant at any signicance
level. Now we can conclude that there is omitted variable bias but mainly
due to the exclusion of nprevist.
To look at the estimated eect of smoking on birth weight, we run the
new multiple regression. However, juste as before, we rst check for heteroscedasticity using the Breusch-Pagan test. Since we have a p-value of 0
we reject the null hypothesis of homoscedasticity at every signicance level.
For the rest of the exercise, we will thus need to use heteroscedasticityrobust standard errors.
We see that the estimated eect of smoking on birth weight is now -217.58,
which is lower than in the biased regression. This reinforces the idea that
the rst regression had omitted variable bias.

(b) Use your estimated regression to predict the birth weight of


all the infants in your sample. Knowing their real birth weight,
does the prediction seem accurate? Explain with a graph.
3

When comparing the real birthweight with its linear prediction in the rst
graph, we see that the predicted values range from 2750 to 4500 while
the real birthweight goes from 500 to almost 6000.

This is not a good

sign, but this makes the graph particularly dicult to interpret since the
scale is dierent on each axis. Thus, we created a new graph with equal
scale on each axis and drew an equality line that corresponds to a perfect
prediction. On this graph it is quite clear that the linear prediction is not
accurate since the tted values are way too condensed around 3500 and
the prediction errors are signicant.

(c) Use the regression to predict the birth weight of Marie's


child, taking into account that Marie smoked during her pregnancy, did not drink alcohol and had 8 prenatal care visits. You
can use STATA or compute it manually.
Using the estimated coecients, we can expect Marie's child to weight
3106 grams.

(d) Compute R and R adjusted. What do they mean? Why


are they so similar?
R2
2
R

R2 =

7.29%
7.19%

n
P
(Yi Y )2
i=1
n
P

represents the percentage of the variance of birthweight

(Yi Y )2

i=1

that is explained by its linear relationship with smoker, alcohol and nprevist (as expressed in the model). It is a good tool to asses how well the
regression line ts with the real data. Here we can see that our regression
model only explains around 7% of the variations in birth weight, so our
model is not very good at explaining birth weight.
The problem with

R2

is that it always increases when explanatory vari-

ables are added to the model, hence supposing that every independent
variable is useful in explaining the variation of the dependent variable.

)(n1)
R = 1 (1R
nk1

solves this issue by only looking at the percentage of

variation explained by the independent variables that actually aect the

dependent variables. As a result, the fact that the two percentages are so
similar implies that all three explanatory variables are relevant in explaining the variation of birthweight. Mathematically, we can see that when
is very high in comparison to

k = 3): n 1 h n k 1

n1
nk1

(which is the case here with

n = 3000

and

h 1 R h R2

In conclusion, the model does not do a very good job in explaining birthweight but the explanatory variables used are all relevant, so there might
rather be an issue of omitted variable for example.

(e) Construct a 95% condence interval for 3 (!!! there was


a typo in this exercise, it should have been 3 instead of b3 ,
hence related mistakes will not be penalized!!!) Is the coecient
statistically signicant? Explain what the coecient means and
the test used to determine its importance.
We are 95% certain that

3 [172.84; 111.85].

The coecient3 is the

impact on the birth weight if the mother drank during her pregnancy
(compared to a case where she did not). To test its importance we use a

H0 : 3 = 0
H1 : 3 6= 0
values of 3 that

single coecient t-test where

. Since the condence inter-

val is comprised of all the

are not rejected by the test

and zero is part of the condence interval for


hypothesis that

3 = 0.

3 ,

we cannot reject the

Another option is to look at the p-value of the

t-test, which is given in the regression results: we nd a p-value of 67.5%


which is way above any decent signicance level, so we reiterate that we
cannot reject the hypothesis that the coecient is not signicant.

An alternative way to control for prenatal visits is to use the binary variables
tripre0, tripre1, tripre2, tripre3. Estimate the following regression:

birthweighti = 1 +2 smokeri +3 alcoholi +4 tripre0i +5 tripre2i +6 tripre3i +i

(a) Why is tripre1 excluded from the regression? What would


happen if you include it in the regression?
If tripre1 were included in the regression, we would be in the case of a
dummy variable trap.

The dummy variable trap arises when there are

G binary variables that are all included as regressors, each observation


falls into one and only one category, and there is an intercept in the
regression.

This applies to our situation since all mothers will have at

least one of the tripre variables that is equal to 1. The issue is that such a
situation leads to a problem of perfect multicollinearity where one of the
regressor is a function of other regressors (here the tripre variables would
each be a function of the three others since knowing the value of one of
them directly gives us the value of the others: if one of them is equal to
one, all others have to be equal to zero), thus violating one of the least
squares assumptions that are required for the OLS estimation. To solve

this problem, it is sucient to remove one of the dummy variables of the


regression, in this case, tripre1.

(b) Interpret the value of the estimated coecient for tripre0.


As before, we rst check for heteroscedasticity using the Breusch-Pagan
test.

Since we have a p-value of 0.49% we reject the null hypothesis of

homoscedasticity at 1% signicance level. For the rest of the exercise, we


will thus need to use heteroscedasticity-robust standard errors.
The estimated coecient for tripre0 is -697.97. This implies that the birth
weight for mothers who go to no prenatal visits is expected to be 698 grams
lower than for mothers that go to their rst prenatal visits in their rst
semester.

(c) Does this regression better explain birthweight than the one
you ran in point 4? Explain.
By comparing the adjusted

R2

between the two regressions, we have

Model with nprevist


Model with tripre

R = 7.19%
2
R = 4.49%

It is clear that the number of prenatal visits is more determinant for the
birth weight than the semester at which the rst visit happened since the
second regression has less explanatory power thant the rst one. So no,
this regression does not better explain birthweight.

Write down the most informative regressions you can create using the variables in your dataset and summarize them on a table form, with the
standard error below the respective regression coecient, the adjusted
R, sample size, etc (Hint: you should create a table similar to Table 5.2.,
slide 63 from the lecture on Multiple regression).
First, using the results above, we know that including the number of prenatal visits is better than including dummies for the semester at which the
rst visit happened: we already know that the most informative regres-

12 .

sions should include nprevist instead of tripre0,tripre1,tripre2,tripre3


Hence, we will start by running the following regression:

birthweighti = 0 +1 nprevisti +2 alcoholi +3 smokeri +4 unmarriedi

+5 educi + 6 agei + 7 drinks + ui


We will also use the following correlation table to help us in our decision:

1 You don't want to include both since they are too highly correlated (they carry the same
type of information)
2 If you chose to include the dummy variables instead of nprevist, you have to include three
out of four of them as including only one on its own doesn't make much sense

Using the regression results, we have that drinks is the least signicant
variable.

Looking at the correlation, it is clear that drinks and alcohol

are highly correlated. For that reason, we remove drinks from our next
regression and we obtain a higher
least signicant variable.

. It is now educ that has become the

Since it is correlated with unmarried, smoker

and nprevist, we decide to remove it also and again we managed to improve our

. Here, it is alcohol that isn't signicant enough. However,

alcohol isn't strongly correlated with the other variables of the model: we
choose to remove it, once again increasing our
our informative regressions.

, but we will keep it in

Now it is the age variable that is the least

signicant. Since it is strongly correlated with unmarried and correlated


with nprevist and smoker, it makes sense to remove it from the regression.
However, in this case, removing the age variable leads to a slightly lower

. Hence we will keep these two regressions as useful ones

3.

Here is the table summarizing what we considered as our most informative


regressions to explain birthweight:

3 We could have directly removed these four variables since they all weren't signicant
enough in the rst regression, but it is nice to see that each removal actually improved the
regression.

Você também pode gostar