Escolar Documentos
Profissional Documentos
Cultura Documentos
(AS13)
EPM304 Advanced Statistical Methods in Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.1
SC14
SC15
This study was set up to compare babies conceived following in-vitro fertilization to
those from the general population. The data used here refer to the records of 641
singleton births following in-vitro fertilization (IVF).
10
15
20
25
30
35
r = 0.9
10
x
Interaction: Pulldown: r = 0.7:
10
20
30
40
r = 0.7
x
Interaction: Pulldown: r = 0.3:
10
20 25 30 35 40 45 50
r = 0.3
10
x
Interaction: Pulldown: r = -0.5:
50
20
30
40
60
70
r = -0.5
10
The correlation coefficient tells us the strength and direction of the linear
relationship, but it does not allow us to quantify the relationship of the two variables,
for example, by how much does one variable change, on average, with a unit change
in the other. It also does not allow us to quantify the relationship of two variables,
while adjusting for confounding with a third variable. It also assumes a linear
relationship between the two variables, which may not be the case. For such
situations we need linear regression.
Note that a is the predicted value of y when x is zero and b is the gradient of the
slope i.e. a one unit change in x is predicted to lead to a change in y of b units. In
our example, a would be the predicted birth weight for a gestational age of zero and
b would be the increase in birth weight in grams given by a one week increase in
gestational age.
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
gestational age in weeks
40
The vertical distance from each point to the line is called a residual. Press swap to
see these illustrated.
Interaction: Button:
Swap: Output (changes to figure below):
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
gestational age in weeks
40
4000
Birthweight (grams)
2000
3000
-133.6
555.2
-666.0
-800.3
1000
26.6
-305.8
623.3
25
30
35
gestational age in weeks
40
b =
( x x )( y y )
(x x)
i
a = y bx
Note: these equations are given for information and you are not expected to
memorise them.
b =
( x x )( y y )
(x x)
i
a = y bx
5000
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
gestational age in weeks
40
45
In this example, a will be predicted value of the birth weight for a baby with a
gestational age of zero weeks. This does not make much sense and so we first centre
the gestational age around a central point, in this case the mean gestational age in
our sample, which was 38.7 weeks i.e. we subtract 38.7 from each gestational age in
our dataset. This is called mean-centering. Now a represents the predicted birth
weight in grams at the average gestational age in our sample of 38.7 weeks.
Coef.
Std. Err.
P>t
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
Note: the mean birth weight in our dataset is 3129.137 i.e. the same as a the
regression line will always go through the point ( x , y ) .
How would you interpret the coefficient of mgest (the mean-centred gestational
age)?
Interaction: thought bubble:
Output (appears below):
For every increase in gestational age of one week, the predicted birth weight
increases by 206.6 grams.
Coef.
Std. Err.
P>t
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
The estimate of the slope, b, is the expected change in the outcome (i.e. birth
weight) for a unit increase in the exposure variable (i.e. gestational age). Here it is
estimated to be 206.6 grams. The output gives the 95% confidence interval for this
parameter to be 191.9g to 221.3g.
If there was no association between gestational age and birth weight the true value
of the parameter would be zero and the points on the scatter plot would be randomly
scattered about the mean values of birth weight. However, based on this analysis the
lower limit of the 95% confidence interval is substantially above zero indicating there
is strong evidence for an association.
We can confirm this by looking at the Wald test which compares the ratio of the
parameter estimate to its standard error with a t distribution. The larger the value of
b compared to its standard error the larger the test statistic, and the smaller the Pvalue (stronger evidence of an association). The test statistic t is given as 27.6 under
the column labelled t in the output.
The null hypothesis of the test is that b is zero, or in other words that there is no
association between birth weight and gestational age. The P-value is reported as
P<0.001 under P>t confirming that there is very strong evidence of an association
between birth weight and gestational age, when the relationship is modelled as
linear.
Coef.
Std. Err.
P>t
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
We can see from the output that the best prediction of birth weight will be given
by the equation:
Birth weight = 3129.1 + 206.6*mgest
= 3129.1 + 206.6*(gestational age 38.7)
What is the best prediction of the birth weight for a gestational age of 30 weeks
(to the nearest gram)?
Interaction: Calculation: (calc)
Output(appears in new window):
Incorrect answer:
No. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram
Correct answer:
Correct
Yes. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram
.4
.3
Density
.2
.1
0
-4
-2
0
Standardized residuals
checked from a scatter plot of the residuals versus the predicted values y (also
called fitted values) i.e. for each point on the graph below, the fitted value is
plotted against the residual.
5000
4000
Birthweight (grams)
2000
3000
1000
residual
25
30
35
gestational age in weeks
40
45
-4
Standardized residuals
-2
0
2
The residuals should be randomly scattered about zero with constant variance
over the predicted values.
1000
2000
Fitted values
3000
4000
y , and no evidence of a
y .
This scatter plot shows a roughly linear relationship between y and x and so linear
regression is appropriate.
Interaction: Pulldown: remote point:
There is a remote point, an observation which is far from the range of the other
data.
Interaction: Pulldown: outlier:
There is an outlier, a point which is not well fitted by the model. Remote points
and outliers can change the regression parameters substantially, and in this case
they are known as influential points. In order to identify influential points the
model can be re-fitted with one observation left out each time, or they can be
spotted by eye in a scatter plot and then the regression parameters estimated with
and without the observation in the model. The first step once an influential point
has been identified would be to check whether there has been a data-entry error, if
this check is possible. If there is no data entry error, the observation should not
be removed from the data unless there is very good reason to do so. One option is
to report the results of the analysis with the observation included and with it
removed in order to demonstrate how sensitive the results are to this observation.
Interaction: Pulldown: non-linear:
In this example, the values of y seem to initially decrease with increasing values of
x, but then start to increase with even greater values of x. Hence, a linear regression
will not give a good fit to such data and would be inappropriate.
Interaction: Pulldown: two clusters:
There are two clusters of points, with what appears to be random scatter in each
cluster. This may suggest some kind of threshold in x, below which y takes one
average value and above which y takes another average value.
Interaction: Pulldown: two lines:
There appear to be two different lines shown here. It may well be that the value of y
depends on x and on another binary variable.
Categories
gestational age
Freq.
Mean of birth
weights
Standard deviation
of birth weights
<38 wks
38-39 wks
39-40 wks
>40 wks
157
130
166
188
2,445.567
3,186.023
3,365.193
3,452.223
705.815
408.357
456.661
441.366
The means of birth weight for the four groups increase sequentially from around 2.5
kg to 3.5 kg.
Coef.
Std. Err.
P>t
gestcat
2
3
4
740.4562
919.6259
1006.657
61.27115
57.522
55.86212
12.08
15.99
18.02
0.000
0.000
0.000
620.1383
806.6702
896.9604
860.7741
1032.582
1116.353
_cons
2445.567
41.23697
59.31
0.000
2364.59
2526.544
Coef.
Std. Err.
P>t
gestcat
2
3
4
740.4562
919.6259
1006.657
61.27115
57.522
55.86212
12.08
15.99
18.02
0.000
0.000
0.000
620.1383
806.6702
896.9604
860.7741
1032.582
1116.353
_cons
2445.567
41.23697
59.31
0.000
2364.59
2526.544
The intercept, estimated as 2445.567, is the predicted mean value of birth weight in
the baseline category of gestational age (i.e. the <38 week category).
The value 740.4562 for gestcat 2 is the predicted increase in birth weight among
babies in the next gestational age category (38-39 weeks) compared to the baseline
category (<38 weeks).
Similarly 919.6259 and 1006.657 are the predicted increases for the next two
categories (39-40 weeks and >40 weeks), respectively, each compared to the
baseline category (<38 weeks).
The estimated increases can be added to the estimated mean in the baseline
category to get the predicted mean value of birth weight in the four categories.
The resulting equations for the predicted values are
y = 2445.567,
y = 2445.567 + 740.4562 = 3186.023,
y = 2445.567 + 919.6259 = 3365.193 and
y = 2445.567 + 1006.657 = 3452.223 respectively for babies in the first, second,
These values correspond exactly to those shown in the column headed Mean of birth
weights in the previous table.
Interaction: Button: Show:
Output (appears below):
Categories
gestational age
Freq.
Mean of birth
weights
Standard deviation
of birth weights
<38 wks
38-39 wks
39-40 wks
>40 wks
157
130
166
188
2445.567
3186.023
3365.193
3452.223
705.8151
408.3571
456.661
441.3662
5000
Birthweight (grams)
2000
3000
4000
1000
25
30
35
gestational age in weeks
40
You can click swap to add the original line to the same graph.
Interaction: Button: Swap:
Output (figure changes to this and text appears below):
45
5000
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
gestational age in weeks
40
45
We can see that the categorical variable gives a reasonable fit to the data in the
middle values of gestational age, but does particularly poorly at younger gestational
ages.
24
22
20
18
16
14
12
10
10
10
10
12
14
16
18
20
22
24
10
12
14
16
18
20
22
24
10
10
12
14
16
18
20
22
24
10
Coef.
Std. Err.
P>t
[95% Conf.
Interval]
mgest
mgest2
_cons
193.2323
-2.796836
3144.296
10.73898
1.608682
19.46009
17.99
-1.74
161.58
0.000
0.083
0.000
172.1443
-5.955789
3106.083
214.3203
.3621158
3182.51
linear
1000
Birthweight (grams)
2000
3000
4000
5000
The graph below shows both the fitted linear and quadratic relationships. We can see
that between 32 and 42 weeks gestation the lines are virtually identical, and it is
only for babies born before 32 weeks that there appears to be any curvature. There
are very few babies in the dataset born before about 32 weeks. It is unsurprising
that there is little statistical evidence of departure from a linear relationship (P=0.08
for mgest2 in the output).
quadratic
30
25
35
gestational age in weeks
40
45
Coef.
Std. Err.
P>t
mgest
gender
_cons
206.4446
-161.7075
3208.604
7.363321
34.29034
24.03779
28.04
-4.72
133.48
0.000
0.000
0.000
191.9853
-229.0431
3161.401
220.9039
-94.37194
3255.806
Coef.
Std. Err.
P>t
mgest
gender
_cons
206.4446
-161.7075
3208.604
7.363321
34.29034
24.03779
28.04
-4.72
133.48
0.000
0.000
0.000
191.9853
-229.0431
3161.401
220.9039
-94.37194
3255.806
The value for mgest of 206.4 gives us the estimated increase in birth weight for each
additional week of gestation, after adjusting for gender. The value -161.7 shows
that females (coded 1 in the data) are estimated to be 161.7 g lighter than males
(coded 0 in the data), after adjusting for week of gestation. Note that gender is not
a confounder between gestational age and birth weight here, because adjusting for it
barely changes the mgest estimate (from 206.6412 to 206.4446). However, we can
see from the p-value and confidence interval that gender is independently associated
with birth weight.
What is the regression equation fitted here?
Interaction: thought bubble:
Output (appears below):
Birth weight = 3208.6 + 206.4*mgest 161.7*gender
= 3208.6 + 206.4*(gestational age 38.7) 161.7*gender
where gender takes the value 0 for males and 1 for females
5000
4000
Birthweight (grams)
2000
3000
difference
is 161.7g
1000
males
females
25
30
35
gestational age in weeks
40
45
This makes clear that we are fitting two parallel lines to the data, with equal
gradients but different intercepts.
What are the regression equations for males and females separately?
Interaction: thought bubble:
Output (appears below):
Males
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*0
= 3208.6 + 206.4*(gestational age 38.7)
Females
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*1
= 3208.6 161.7 + 206.4*(gestational age 38.7)
= 3046.9 + 206.4*(gestational age 38.7)
So we can see that the gradients are the same, but the intercepts are different.
y is determined by x.
( y i
y i)2
This is obtained from an analysis of variance table. For example, the analysis of
variance table produced when fitting the model with gestational age categories
was:
Source
SS
df
MS
Model
Residual
102656030
170064092
3
637
34218676.7
266976.596
Total
272720122
640
426125.19
SS
df
MS
Model
102656030
34218676.7
Residual
170064092
637
266976.596
Total
272720122
640
426125.19
If the residual sum of squares is zero the regression line would fit the data
perfectly, that is every observed point would lie on the fitted line. By contrast
larger values would indicate worse fits, since the deviations of the observed points
from the regression line will be larger. Two possible factors contribute to a high
residual sum of squares; either there is a lot of variation in the data, i.e. 2 is large,
or the model does not explain much of the variation observed.
SS
df
MS
Model
Residual
102656030
170064092
3
637
34218676.7
266976.596
Total
272720122
640
426125.19
170064092/637 = 266976.596
This value is known as the residual mean square (MS shown in the last column in
the table), and its square root is called the Root MSE (mean square error = 516.7).
The larger the Model MS compared to the Residual MS, the better the model fits the
data. Under the Null hypothesis of no effect of gestational age, the Model MS and
the Residual MS are two independent estimates of 2 and their ratio is expected to
be 1. We can then test the significance of the current model, by dividing these two
terms.
F=
Model MS
Residual MS
SS
df
MS
Model
Residual
106953800
165766322
4
636
26738450
260638.871
Total
272720122
640
426125.19
bweight
Coef.
Std. Err.
P>t
gestcat
2
3
4
736.1429
909.2086
1012.249
60.54885
56.89301
55.21226
12.16
15.98
18.33
0.000
0.000
0.000
617.243
797.4877
903.8286
855.0427
1020.929
1120.669
gender
_cons
-164.2448
2528.212
40.44732
45.54496
-4.06
55.51
0.000
0.000
-243.6713
2438.776
-84.81839
2617.649
The analysis of variance table for this model is at the top. For this model, our null
hypothesis is that there is no effect of gestational age (considered in categories) or
sex. A test of whether the true effects of both explanatory variables are zero is
made by examining F(4, 636) = 102.59 and referring this to the F distribution. The
p-value for this hypothesis is again <0.001, indicating strong evidence that the two
exposure variables gestational age categories and gender taken together contribute
to explain the data variability.
Section 9: Summary
This is the end of AS13.
The main points of this session will appear below as you click on the relevant title.
Fitting regression equation
Fitting a linear regression equation to data from two quantitative variables uses
one of two approaches, both of which give the same result: least squares
minimises the sum of squares of residuals, while maximum likelihood assumes
the residuals follow a Normal distribution with mean=0 and standard deviation=1.
Checking assumptions
Linearity can be checked by a scatter-plot of the outcome and explanatory
variables, Normality of residuals can be checked with a histogram of standardised
residuals and constancy of the variance of the residuals can be checked by
examining a scatter-plot of the residuals plotted against the fitted values.
Allowing for a non-linear relationship
Two common ways of allowing for a non-linear relationship are to either:
categorise the explanatory variable or to allow for a quadratic relationship by
including both the explanatory variable and its square in the regression.