STATS301 PS1 Corrected

STAT-S-301 PROBLEM SET 1 SOLUTIONS
TA: Elise Petit

November 2016
Construct a scatterplot of
birthweight
on
smoker.
Is there a relationship
between the variables? Explain.
When plotting the infant's birthweight against the smoker variable, one
can see that two main lines appear: this is due to the fact that the smoker
variable is binary. However, we can still try to see if there is a relationship
between these two variables by looking at the distribution of each line:
we see that birthweight is more dispersed and reaches higher values for
non-smoking mothers compared to smoking mothers, hence we suspect a
negative relationship between smoking and birthweight.
The tted values line represent the values of birthweight that would be
predicted for dierent values of smoker using a simple regression model.
Since smoker is a binary variable, only the two extremities can be interpreted and the line can be seen as a connector of these two points: this
shows a negative impact of smoking on birthweight.
However, no strong conclusions can be drawn from that plot: we will need
to conduct further analyses to see if the suspected negative relationship
Conclusion: suspicion
of negative relationship but no clear evidence
between smoking and birthweight is conrmed.
What is the average birth weight for all mothers? mothers who smoke? mothers who do not smoke? Estimate the dierence in average between these
two types and construct a 95% condence interval for that dierence. Does
smoking seem to aect birth weight? Explain.
Average birth weight
All mothers
3382.93
Mothers who smoked
3178.83
Mothers who didn't smoke
3432.06
The dierence in average birthweight between mothers who didn't smoke

and mothers who smoked is estimated to be 253.23, with the true dierence
belonging to the condence interval [200.38,306.07] at 95% certainty. Since
zero is not included in this condence interval, we can reject the hypothesis
that there is no dierence between the two subsample average.
Hence,
smoking seems to signicantly aect birth weight at a 95% condence

level.
Using all observations, estimate the following regression:
birthweighti = 1 + 2 smokeri + i
(a) What is the estimated slope

b1 ? What do they mean?
b2
and the estimated intercept
After running a regular regression, we rst check for heteroscedasticity

using the Breusch-Pagan test. Since we have a p-value of 0.79 we cannot
reject the null hypothesis of homoscedasticity at any decent signicance
level. We cannot conclude that there is heteroscedasticity in our model
but it is always better to use heteroskedasticity-robust standard errors
rather than homoskedasticity-only ones (since the former are smaller, thus
leading to smaller standard deviation and more reliable estimations).
Our OLS estimators give:
b2 = 253.23, which is the expected decrease in
birthweight when the mother smoked compared to a mother who did not
smoke, and
b1 = 3432.06,
which is the expected birthweight for mothers
who didn't smoke.
b) How are they linked with what you did in point 2? Can you
conclude that smoking has a signicant eect on birth weight ?
(
b1
corresponds to the average birthweight computed among mothers who
didn't smoke while
b2
corresponds to the estimated dierence in average
birthweight between mothers who didn't smoke and mothers who smoked.
We had already concluded that this dierence was signicantly dierent

from zero which means that smoking has an eect. We can now also use
our regression results to check that smoker is a signicant variable on birth
weight by looking at the result of the single coecient signicance test.
Since the p-value associated with the signicance test operated on smoker
is null, we can reject the null hypothesis that the coecient is equal to zero
and again conclude that smoker has a signicant impact on birthweight.
Estimate the new following regression:
birthweighti = 1 + 2 smokeri + 3 alcoholi + 4 nprevisti + i
(a) Explain why the exclusion of alcohol and nprevist could lead
to omitted variable bias in the regression estimated in point 3.
Does the estimated eect of smoking on birth weight dier from
the estimation in point 3?
There is omitted variable bias if the omitted variable is correlated with
the included regressor and is also a determinant of the dependent variable.
Let's rst check if alcohol and nprevist are correlated with smoker (the
regressor of the regression estimated in point 3):
correlation
smoker
alcohol
12.09%
nprevist
-10.86%
Both variables are signicantly correlated with smoker, but the correlation
is quite low.
We can now check if alcohol and nprevist are determinants of birthweight
by running two simple regression models of birthweight against alcohol
then birthweight against nprevist. Looking at their p-value, we see that
alcohol is signicant in a simple regression with birthweight only at the
6.5% signicance level, while nprevist is signicant at any signicance
level. Now we can conclude that there is omitted variable bias but mainly
due to the exclusion of nprevist.
To look at the estimated eect of smoking on birth weight, we run the
new multiple regression. However, juste as before, we rst check for heteroscedasticity using the Breusch-Pagan test. Since we have a p-value of 0
we reject the null hypothesis of homoscedasticity at every signicance level.
For the rest of the exercise, we will thus need to use heteroscedasticityrobust standard errors.
We see that the estimated eect of smoking on birth weight is now -217.58,
which is lower than in the biased regression. This reinforces the idea that
the rst regression had omitted variable bias.
(b) Use your estimated regression to predict the birth weight of

all the infants in your sample. Knowing their real birth weight,
does the prediction seem accurate? Explain with a graph.
3
When comparing the real birthweight with its linear prediction in the rst
graph, we see that the predicted values range from 2750 to 4500 while
the real birthweight goes from 500 to almost 6000.
This is not a good
sign, but this makes the graph particularly dicult to interpret since the
scale is dierent on each axis. Thus, we created a new graph with equal
scale on each axis and drew an equality line that corresponds to a perfect
prediction. On this graph it is quite clear that the linear prediction is not
accurate since the tted values are way too condensed around 3500 and
the prediction errors are signicant.
(c) Use the regression to predict the birth weight of Marie's

child, taking into account that Marie smoked during her pregnancy, did not drink alcohol and had 8 prenatal care visits. You
can use STATA or compute it manually.
Using the estimated coecients, we can expect Marie's child to weight
3106 grams.
(d) Compute R and R adjusted. What do they mean? Why

are they so similar?
R2
2
R
R2 =
7.29%
7.19%
n
P
(Yi Y )2
i=1
n
P
represents the percentage of the variance of birthweight
(Yi Y )2
i=1
that is explained by its linear relationship with smoker, alcohol and nprevist (as expressed in the model). It is a good tool to asses how well the
regression line ts with the real data. Here we can see that our regression
model only explains around 7% of the variations in birth weight, so our
model is not very good at explaining birth weight.
The problem with
R2
is that it always increases when explanatory vari-
ables are added to the model, hence supposing that every independent
variable is useful in explaining the variation of the dependent variable.
)(n1)
R = 1 (1R
nk1
solves this issue by only looking at the percentage of
variation explained by the independent variables that actually aect the
dependent variables. As a result, the fact that the two percentages are so
similar implies that all three explanatory variables are relevant in explaining the variation of birthweight. Mathematically, we can see that when
is very high in comparison to
k = 3): n 1 h n k 1
n1
nk1
(which is the case here with
n = 3000
and
h 1 R h R2
In conclusion, the model does not do a very good job in explaining birthweight but the explanatory variables used are all relevant, so there might
rather be an issue of omitted variable for example.
(e) Construct a 95% condence interval for 3 (!!! there was

a typo in this exercise, it should have been 3 instead of b3 ,
hence related mistakes will not be penalized!!!) Is the coecient
statistically signicant? Explain what the coecient means and
the test used to determine its importance.
We are 95% certain that
3 [172.84; 111.85].
The coecient3 is the
impact on the birth weight if the mother drank during her pregnancy
(compared to a case where she did not). To test its importance we use a
H0 : 3 = 0
H1 : 3 6= 0
values of 3 that
single coecient t-test where
. Since the condence inter-
val is comprised of all the
are not rejected by the test
and zero is part of the condence interval for

hypothesis that
3 = 0.
3 ,
we cannot reject the
Another option is to look at the p-value of the
t-test, which is given in the regression results: we nd a p-value of 67.5%

which is way above any decent signicance level, so we reiterate that we
cannot reject the hypothesis that the coecient is not signicant.
An alternative way to control for prenatal visits is to use the binary variables
tripre0, tripre1, tripre2, tripre3. Estimate the following regression:
birthweighti = 1 +2 smokeri +3 alcoholi +4 tripre0i +5 tripre2i +6 tripre3i +i
(a) Why is tripre1 excluded from the regression? What would

happen if you include it in the regression?
If tripre1 were included in the regression, we would be in the case of a
dummy variable trap.
The dummy variable trap arises when there are
G binary variables that are all included as regressors, each observation

falls into one and only one category, and there is an intercept in the
regression.
This applies to our situation since all mothers will have at
least one of the tripre variables that is equal to 1. The issue is that such a
situation leads to a problem of perfect multicollinearity where one of the
regressor is a function of other regressors (here the tripre variables would
each be a function of the three others since knowing the value of one of
them directly gives us the value of the others: if one of them is equal to
one, all others have to be equal to zero), thus violating one of the least
squares assumptions that are required for the OLS estimation. To solve
this problem, it is sucient to remove one of the dummy variables of the

regression, in this case, tripre1.
(b) Interpret the value of the estimated coecient for tripre0.

As before, we rst check for heteroscedasticity using the Breusch-Pagan
test.
Since we have a p-value of 0.49% we reject the null hypothesis of
homoscedasticity at 1% signicance level. For the rest of the exercise, we

will thus need to use heteroscedasticity-robust standard errors.
The estimated coecient for tripre0 is -697.97. This implies that the birth
weight for mothers who go to no prenatal visits is expected to be 698 grams
lower than for mothers that go to their rst prenatal visits in their rst
semester.
(c) Does this regression better explain birthweight than the one
you ran in point 4? Explain.
By comparing the adjusted
R2
between the two regressions, we have
Model with nprevist

Model with tripre
R = 7.19%
2
R = 4.49%
It is clear that the number of prenatal visits is more determinant for the
birth weight than the semester at which the rst visit happened since the
second regression has less explanatory power thant the rst one. So no,
this regression does not better explain birthweight.
Write down the most informative regressions you can create using the variables in your dataset and summarize them on a table form, with the
standard error below the respective regression coecient, the adjusted
R, sample size, etc (Hint: you should create a table similar to Table 5.2.,
slide 63 from the lecture on Multiple regression).
First, using the results above, we know that including the number of prenatal visits is better than including dummies for the semester at which the
rst visit happened: we already know that the most informative regres-
12 .
sions should include nprevist instead of tripre0,tripre1,tripre2,tripre3

Hence, we will start by running the following regression:
birthweighti = 0 +1 nprevisti +2 alcoholi +3 smokeri +4 unmarriedi
+5 educi + 6 agei + 7 drinks + ui

We will also use the following correlation table to help us in our decision:
1 You don't want to include both since they are too highly correlated (they carry the same
type of information)
2 If you chose to include the dummy variables instead of nprevist, you have to include three
out of four of them as including only one on its own doesn't make much sense
Using the regression results, we have that drinks is the least signicant
variable.
Looking at the correlation, it is clear that drinks and alcohol
are highly correlated. For that reason, we remove drinks from our next
regression and we obtain a higher
least signicant variable.
. It is now educ that has become the
Since it is correlated with unmarried, smoker
and nprevist, we decide to remove it also and again we managed to improve our
. Here, it is alcohol that isn't signicant enough. However,
alcohol isn't strongly correlated with the other variables of the model: we
choose to remove it, once again increasing our
our informative regressions.
, but we will keep it in
Now it is the age variable that is the least
signicant. Since it is strongly correlated with unmarried and correlated

with nprevist and smoker, it makes sense to remove it from the regression.
However, in this case, removing the age variable leads to a slightly lower
. Hence we will keep these two regressions as useful ones
3.
Here is the table summarizing what we considered as our most informative

regressions to explain birthweight:
3 We could have directly removed these four variables since they all weren't signicant
enough in the rst regression, but it is nice to see that each removal actually improved the
regression.

STATS301 PS1 Corrected

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

STATS301 PS1 Corrected

Enviado por

Direitos autorais:

Formatos disponíveis

STAT-S-301 PROBLEM SET 1 SOLUTIONS

TA: Elise Petit

between the variables? Explain.

Mothers who smoked

Mothers who didn't smoke

The dierence in average birthweight between mothers who didn't smoke

smoking seems to signicantly aect birth weight at a 95% condence

Using all observations, estimate the following regression:

(a) What is the estimated slope

and the estimated intercept

After running a regular regression, we rst check for heteroscedasticity

b2 = 253.23, which is the expected decrease in

which is the expected birthweight for mothers

who didn't smoke.

corresponds to the average birthweight computed among mothers who

didn't smoke while

corresponds to the estimated dierence in average

We had already concluded that this dierence was signicantly dierent

Estimate the new following regression:

birthweighti = 1 + 2 smokeri + 3 alcoholi + 4 nprevisti + i

(b) Use your estimated regression to predict the birth weight of

This is not a good

(c) Use the regression to predict the birth weight of Marie's

(d) Compute R and R adjusted. What do they mean? Why

represents the percentage of the variance of birthweight

is that it always increases when explanatory vari-

solves this issue by only looking at the percentage of

variation explained by the independent variables that actually aect the

(which is the case here with

(e) Construct a 95% condence interval for 3 (!!! there was

The coecient3 is the

single coecient t-test where

. Since the condence inter-

val is comprised of all the

are not rejected by the test

and zero is part of the condence interval for

we cannot reject the

Another option is to look at the p-value of the

t-test, which is given in the regression results: we nd a p-value of 67.5%

birthweighti = 1 +2 smokeri +3 alcoholi +4 tripre0i +5 tripre2i +6 tripre3i +i

(a) Why is tripre1 excluded from the regression? What would

The dummy variable trap arises when there are

G binary variables that are all included as regressors, each observation

This applies to our situation since all mothers will have at

this problem, it is sucient to remove one of the dummy variables of the

(b) Interpret the value of the estimated coecient for tripre0.

Since we have a p-value of 0.49% we reject the null hypothesis of

homoscedasticity at 1% signicance level. For the rest of the exercise, we

between the two regressions, we have

Model with nprevist

sions should include nprevist instead of tripre0,tripre1,tripre2,tripre3

birthweighti = 0 +1 nprevisti +2 alcoholi +3 smokeri +4 unmarriedi

+5 educi + 6 agei + 7 drinks + ui

Looking at the correlation, it is clear that drinks and alcohol

. It is now educ that has become the

Since it is correlated with unmarried, smoker

. Here, it is alcohol that isn't signicant enough. However,

, but we will keep it in

Now it is the age variable that is the least

signicant. Since it is strongly correlated with unmarried and correlated

. Hence we will keep these two regressions as useful ones

Here is the table summarizing what we considered as our most informative

Você também pode gostar

The dierence in average birthweight between mothers who didn't smoke

smoking seems to signicantly aect birth weight at a 95% condence

After running a regular regression, we rst check for heteroscedasticity

corresponds to the estimated dierence in average

We had already concluded that this dierence was signicantly dierent

variation explained by the independent variables that actually aect the

(e) Construct a 95% condence interval for 3 (!!! there was

The coecient3 is the

single coecient t-test where

. Since the condence inter-

and zero is part of the condence interval for

t-test, which is given in the regression results: we nd a p-value of 67.5%

this problem, it is sucient to remove one of the dummy variables of the

(b) Interpret the value of the estimated coecient for tripre0.

homoscedasticity at 1% signicance level. For the rest of the exercise, we

. Here, it is alcohol that isn't signicant enough. However,

signicant. Since it is strongly correlated with unmarried and correlated