Escolar Documentos
Profissional Documentos
Cultura Documentos
Construct a scatterplot of
birthweight
on
smoker.
Is there a relationship
When plotting the infant's birthweight against the smoker variable, one
can see that two main lines appear: this is due to the fact that the smoker
variable is binary. However, we can still try to see if there is a relationship
between these two variables by looking at the distribution of each line:
we see that birthweight is more dispersed and reaches higher values for
non-smoking mothers compared to smoking mothers, hence we suspect a
negative relationship between smoking and birthweight.
The tted values line represent the values of birthweight that would be
predicted for dierent values of smoker using a simple regression model.
Since smoker is a binary variable, only the two extremities can be interpreted and the line can be seen as a connector of these two points: this
shows a negative impact of smoking on birthweight.
However, no strong conclusions can be drawn from that plot: we will need
to conduct further analyses to see if the suspected negative relationship
Conclusion: suspicion
of negative relationship but no clear evidence
between smoking and birthweight is conrmed.
What is the average birth weight for all mothers? mothers who smoke? mothers who do not smoke? Estimate the dierence in average between these
two types and construct a 95% condence interval for that dierence. Does
smoking seem to aect birth weight? Explain.
Average birth weight
All mothers
3382.93
3178.83
3432.06
Hence,
birthweighti = 1 + 2 smokeri + i
b2
birthweight when the mother smoked compared to a mother who did not
smoke, and
b1 = 3432.06,
b) How are they linked with what you did in point 2? Can you
conclude that smoking has a signicant eect on birth weight ?
(
b1
b2
birthweight between mothers who didn't smoke and mothers who smoked.
(a) Explain why the exclusion of alcohol and nprevist could lead
to omitted variable bias in the regression estimated in point 3.
Does the estimated eect of smoking on birth weight dier from
the estimation in point 3?
There is omitted variable bias if the omitted variable is correlated with
the included regressor and is also a determinant of the dependent variable.
Let's rst check if alcohol and nprevist are correlated with smoker (the
regressor of the regression estimated in point 3):
correlation
smoker
alcohol
12.09%
nprevist
-10.86%
Both variables are signicantly correlated with smoker, but the correlation
is quite low.
We can now check if alcohol and nprevist are determinants of birthweight
by running two simple regression models of birthweight against alcohol
then birthweight against nprevist. Looking at their p-value, we see that
alcohol is signicant in a simple regression with birthweight only at the
6.5% signicance level, while nprevist is signicant at any signicance
level. Now we can conclude that there is omitted variable bias but mainly
due to the exclusion of nprevist.
To look at the estimated eect of smoking on birth weight, we run the
new multiple regression. However, juste as before, we rst check for heteroscedasticity using the Breusch-Pagan test. Since we have a p-value of 0
we reject the null hypothesis of homoscedasticity at every signicance level.
For the rest of the exercise, we will thus need to use heteroscedasticityrobust standard errors.
We see that the estimated eect of smoking on birth weight is now -217.58,
which is lower than in the biased regression. This reinforces the idea that
the rst regression had omitted variable bias.
When comparing the real birthweight with its linear prediction in the rst
graph, we see that the predicted values range from 2750 to 4500 while
the real birthweight goes from 500 to almost 6000.
sign, but this makes the graph particularly dicult to interpret since the
scale is dierent on each axis. Thus, we created a new graph with equal
scale on each axis and drew an equality line that corresponds to a perfect
prediction. On this graph it is quite clear that the linear prediction is not
accurate since the tted values are way too condensed around 3500 and
the prediction errors are signicant.
R2 =
7.29%
7.19%
n
P
(Yi Y )2
i=1
n
P
(Yi Y )2
i=1
that is explained by its linear relationship with smoker, alcohol and nprevist (as expressed in the model). It is a good tool to asses how well the
regression line ts with the real data. Here we can see that our regression
model only explains around 7% of the variations in birth weight, so our
model is not very good at explaining birth weight.
The problem with
R2
ables are added to the model, hence supposing that every independent
variable is useful in explaining the variation of the dependent variable.
)(n1)
R = 1 (1R
nk1
dependent variables. As a result, the fact that the two percentages are so
similar implies that all three explanatory variables are relevant in explaining the variation of birthweight. Mathematically, we can see that when
is very high in comparison to
k = 3): n 1 h n k 1
n1
nk1
n = 3000
and
h 1 R h R2
In conclusion, the model does not do a very good job in explaining birthweight but the explanatory variables used are all relevant, so there might
rather be an issue of omitted variable for example.
3 [172.84; 111.85].
impact on the birth weight if the mother drank during her pregnancy
(compared to a case where she did not). To test its importance we use a
H0 : 3 = 0
H1 : 3 6= 0
values of 3 that
3 = 0.
3 ,
An alternative way to control for prenatal visits is to use the binary variables
tripre0, tripre1, tripre2, tripre3. Estimate the following regression:
least one of the tripre variables that is equal to 1. The issue is that such a
situation leads to a problem of perfect multicollinearity where one of the
regressor is a function of other regressors (here the tripre variables would
each be a function of the three others since knowing the value of one of
them directly gives us the value of the others: if one of them is equal to
one, all others have to be equal to zero), thus violating one of the least
squares assumptions that are required for the OLS estimation. To solve
(c) Does this regression better explain birthweight than the one
you ran in point 4? Explain.
By comparing the adjusted
R2
R = 7.19%
2
R = 4.49%
It is clear that the number of prenatal visits is more determinant for the
birth weight than the semester at which the rst visit happened since the
second regression has less explanatory power thant the rst one. So no,
this regression does not better explain birthweight.
Write down the most informative regressions you can create using the variables in your dataset and summarize them on a table form, with the
standard error below the respective regression coecient, the adjusted
R, sample size, etc (Hint: you should create a table similar to Table 5.2.,
slide 63 from the lecture on Multiple regression).
First, using the results above, we know that including the number of prenatal visits is better than including dummies for the semester at which the
rst visit happened: we already know that the most informative regres-
12 .
1 You don't want to include both since they are too highly correlated (they carry the same
type of information)
2 If you chose to include the dummy variables instead of nprevist, you have to include three
out of four of them as including only one on its own doesn't make much sense
Using the regression results, we have that drinks is the least signicant
variable.
are highly correlated. For that reason, we remove drinks from our next
regression and we obtain a higher
least signicant variable.
and nprevist, we decide to remove it also and again we managed to improve our
alcohol isn't strongly correlated with the other variables of the model: we
choose to remove it, once again increasing our
our informative regressions.
3.
3 We could have directly removed these four variables since they all weren't signicant
enough in the rst regression, but it is nice to see that each removal actually improved the
regression.