As 13

Analysis of quantitative outcomes
(AS13)
EPM304 Advanced Statistical Methods in Epidemiology
Course: PG Diploma/ MSc Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.1
Section 1: Analysis of quantitative outcomes

Aims
To give an introduction to analysing quantitative outcomes in regression
Objectives
By the end of this session you will be able to:
model the relationship between a quantitative outcome and explanatory variable(s)
using linear regression;
interpret the parameters of the regression model and use significance tests to
assess the strength of evidence of an association;
use regression diagnostics to check the model assumptions;
use regression modelling to adjust for confounding of an explanatory variable by
another variable.
Section 2: Planning your study

In SC14 and SC15 you were introduced to assessment of correlation between
two quantitative variables, as well as linear regression of a quantitative outcome
(note: quantitative variables are also referred to as continuous variables).
The aim of this session is to recap and extend this work.
If you need to review any materials before you continue, refer to the appropriate
sessions below.
Correlation
Linear regression
SC14
SC15
2.1: Planning your study

To illustrate the concepts and method of quantitative regression we will use a
simple example and data from one study:
The In-Vitro Fertilization study
Click on this study to see the details below.
Interaction: Hyperlink: The In-Vitro Fertilization study:
Output (appears in new window):
In-Vitro Fertilization study
This study was set up to compare babies conceived following in-vitro fertilization to
those from the general population. The data used here refer to the records of 641
singleton births following in-vitro fertilization (IVF).
Section 3: Background - correlation

We want to examine the association between birth weight and gestational age in
our dataset. Firstly we use a scatter plot:
What statistic could we calculate to determine the strength of association

between these two variables?
Interaction: thought bubble:
Output (appears below):
Pearsons correlation coefficient, r, is a measure between -1 and +1 governed by
the direction and strength of the relationship between two quantitative variables.
If one increases as the other increases r is positive, and if one decreases as the
other increases r is negative. If there is no relationship between the two
variables then r is 0. In a scatter plot of one variable against the other, the closer
the points are to a straight line the closer the value of r will be to +1 or -1, i.e.
the stronger the linear relationship between the variables.
3.1: Background - correlation

Below are some examples of some different correlation coefficients of
hypothetical data. Use the drop-down menu to explore what different values of r
might look like graphically.
Interaction: Pulldown: r = 0.9:
10
15
20
25
30
35
r = 0.9
10
x
10
20
30
40
r = 0.7
x
10
20 25 30 35 40 45 50
r = 0.3
10
x
Interaction: Pulldown: r = -0.5:
50
20
30
40
60
70
r = -0.5
10
3.2: Background - correlation

Returning to the association between birth weight and gestational age, displayed
in the graph below, the Pearsons correlation coefficient was 0.74.
What does this indicate?

Interaction: Hotspot: Weak positive association
Output:
(appears in new window):
No although the correlation coefficient is positive, indicating a positive relationship,
we would generally consider such a large correlation coefficient to indicate a strong
association.
Note however, that there are no fixed rules on what defines a strong versus weak
association.
Interaction: Hotspot: Strong negative association
Output:
No although this is quite a large correlation coefficient indicating a strong
association, it is positive indicating a positive association.
association.
Interaction: Hotspot: Strong positive association
Output:
Yes this is quite a large correlation coefficient indicating a strong association and it
is positive indicating a positive association.
association.
3.3: Background why do we need linear regression?
The correlation coefficient tells us the strength and direction of the linear
relationship, but it does not allow us to quantify the relationship of the two variables,
for example, by how much does one variable change, on average, with a unit change
in the other. It also does not allow us to quantify the relationship of two variables,
while adjusting for confounding with a third variable. It also assumes a linear
relationship between the two variables, which may not be the case. For such
situations we need linear regression.
3.4: Background linear regression

Simple linear regression between two continuous variables uses a method known as
least squares to derive an equation y = a + bx, with a and b chosen to ensure the
best fit of the line to the data.
Note: y are the values of the dependent or outcome variable plotted on the y-axis
(the vertical axis) and x are the values of the independent or exposure variable
plotted on the x-axis (the horizontal axis).
For example:
Note that a is the predicted value of y when x is zero and b is the gradient of the
slope i.e. a one unit change in x is predicted to lead to a change in y of b units. In
our example, a would be the predicted birth weight for a gestational age of zero and
b would be the increase in birth weight in grams given by a one week increase in
gestational age.
3.5: Background linear regression

The least squares method works by reducing the distances between each point and
the line. We will now illustrate this using a subset of the data of only seven points:
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
gestational age in weeks
40
The vertical distance from each point to the line is called a residual. Press swap to
see these illustrated.
Interaction: Button:
Swap: Output (changes to figure below):
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
40
3.6: Background least squares

The least squares estimates of a and b are derived by minimising the sum of the
square of each residual.
By squaring the residuals, we penalise larger residuals, so for example, two residuals
of 0 and 10 would have a sum of squares of 0+100=100, whereas two residuals of 5
and 5 would have a sum of squares of 25+25=50 and so we would prefer the two
residuals of 5 and 5.
Calculate the sum of squares of these residuals to one decimal place:
4000
Birthweight (grams)
2000
3000
-133.6
555.2
-666.0
-800.3
1000
26.6
-305.8
623.3
25
30
35
40
Interaction: Calculation: (calc)

Output(appears in new window):
Incorrect answer:
No. The sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +
555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2
Correct answer:
Correct
Yes, the sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +
555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2
3.7: Background least squares

The estimates of a and b that minimise the sum of squares of the residuals are
given by:
b =
( x x )( y y )
(x x)
i
a = y bx
Note: these equations are given for information and you are not expected to
memorise them.
3.8: Background likelihood theory

An alternative way to estimate a and b is by likelihood theory.
Firstly we rewrite the equation to explicitly include the residuals ei as follows:
yi = a + bxi + ei
Then we assume these residuals ei are Normally distributed with mean zero and
variance 2 i.e. N(0,2). So we can now re-write the log likelihood for the ei using:
ei = yi a bxi
If we now maximise the log likelihood, which is written in terms of the data yi and xi
and the parameters a and b, we obtain the maximum likelihood estimates of a and b.
These turn out to be exactly the same as the least squares estimates i.e.
b =
( x x )( y y )
(x x)
i
a = y bx
3.9: Background interpreting output

Returning to our original example, we fit a linear regression line to the data
5000
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
40
45
In this example, a will be predicted value of the birth weight for a baby with a
gestational age of zero weeks. This does not make much sense and so we first centre
the gestational age around a central point, in this case the mean gestational age in
our sample, which was 38.7 weeks i.e. we subtract 38.7 from each gestational age in
our dataset. This is called mean-centering. Now a represents the predicted birth
weight in grams at the average gestational age in our sample of 38.7 weeks.

If we fit a linear regression line to the data (with bweight representing the birth
weight in grams and mgest representing the mean-centred gestational age in
weeks) we get the following output.
bweight
Coef.
Std. Err.
P>t
[95% Conf. Interval]
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
How would you interpret the constant coefficient of 3129.137?

This is the estimate of a and so it is the predicted birth weight when the meancentred gestational age is zero i.e. the predicted birth weight is 3129.137 when the
gestational age is 38.7 weeks.
Note: the mean birth weight in our dataset is 3129.137 i.e. the same as a the
regression line will always go through the point ( x , y ) .
How would you interpret the coefficient of mgest (the mean-centred gestational
age)?
For every increase in gestational age of one week, the predicted birth weight
increases by 206.6 grams.

bweight
Coef.
Std. Err.
P>t
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
The estimate of the slope, b, is the expected change in the outcome (i.e. birth
weight) for a unit increase in the exposure variable (i.e. gestational age). Here it is
estimated to be 206.6 grams. The output gives the 95% confidence interval for this
parameter to be 191.9g to 221.3g.
If there was no association between gestational age and birth weight the true value
of the parameter would be zero and the points on the scatter plot would be randomly
scattered about the mean values of birth weight. However, based on this analysis the
lower limit of the 95% confidence interval is substantially above zero indicating there
is strong evidence for an association.
We can confirm this by looking at the Wald test which compares the ratio of the
parameter estimate to its standard error with a t distribution. The larger the value of
b compared to its standard error the larger the test statistic, and the smaller the Pvalue (stronger evidence of an association). The test statistic t is given as 27.6 under
the column labelled t in the output.
The null hypothesis of the test is that b is zero, or in other words that there is no
association between birth weight and gestational age. The P-value is reported as
P<0.001 under P>t confirming that there is very strong evidence of an association
between birth weight and gestational age, when the relationship is modelled as
linear.
3.12: Background regression equation

bweight
Coef.
Std. Err.
P>t
mgest
_cons
206.6412
3129.137
7.484572
17.42493
27.61
179.58
0.000
0.000
191.9439
3094.92
221.3386
3163.354
We can see from the output that the best prediction of birth weight will be given
by the equation:
Birth weight = 3129.1 + 206.6*mgest
= 3129.1 + 206.6*(gestational age 38.7)
What is the best prediction of the birth weight for a gestational age of 30 weeks
(to the nearest gram)?
Incorrect answer:
No. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram
Correct answer:
Correct
Yes. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram
Section 4: Checking model assumptions

The assumptions of a linear regression model are
1. The residuals come from a Normal distribution and are independent from
each other (i.e. no correlation in the residuals between observations).
2. The variance of the residuals is constant across y and x.
3. The correct relationship between y and x has been modelled.
We can check whether these assumptions seem reasonable using plots of the
residuals. This is easiest done using standardised residuals (i.e. with mean=0 and
standard deviation=1). These are obtained by dividing each residual by the
standard deviation of all the residuals.
4.1: Normality assumption

The Normality assumption can be checked by producing a histogram of the
residuals.
Note: because these are standardised residuals they should come from a Standard
Normal distribution, e.g. we expect about 95% of values to lie between 2 and +2.
.4
.3
Density
.2
.1
0
-4
-2
0
Standardized residuals
The histogram looks symmetrical and reasonably bell-shaped, therefore the

assumption that the residuals come from a Normal distribution is reasonable. The
larger the sample size the less the shape of the distribution of the residuals will
affect the model estimation.
4.2: Constant variance

The second assumption of constant variance of the residuals across y and x can be
checked from a scatter plot of the residuals versus the predicted values y (also
called fitted values) i.e. for each point on the graph below, the fitted value is
plotted against the residual.
5000
4000
Birthweight (grams)
2000
3000
1000
Predicted (or fitted) value
residual
25
30
35
40
45
-4
Standardized residuals
-2
0
2
The residuals should be randomly scattered about zero with constant variance
over the predicted values.
1000
2000
Fitted values
3000
4000
There is no obvious relationship between the residuals and

change in the variance of the residuals with
y , and no evidence of a
y .
4.3: Linear relationship

We started our analysis with a scatter plot of the data. The importance of this is
highlighted by scatter plots of some hypothetical data below. Use the drop down
menu to examine the different plots. Note: the correlation coefficient is the same in
each example.
Interaction: Pulldown: linear relationship:
This scatter plot shows a roughly linear relationship between y and x and so linear
regression is appropriate.
Interaction: Pulldown: remote point:
There is a remote point, an observation which is far from the range of the other
data.
Interaction: Pulldown: outlier:
There is an outlier, a point which is not well fitted by the model. Remote points
and outliers can change the regression parameters substantially, and in this case
they are known as influential points. In order to identify influential points the
model can be re-fitted with one observation left out each time, or they can be
spotted by eye in a scatter plot and then the regression parameters estimated with
and without the observation in the model. The first step once an influential point
has been identified would be to check whether there has been a data-entry error, if
this check is possible. If there is no data entry error, the observation should not
be removed from the data unless there is very good reason to do so. One option is
to report the results of the analysis with the observation included and with it
removed in order to demonstrate how sensitive the results are to this observation.
Interaction: Pulldown: non-linear:
In this example, the values of y seem to initially decrease with increasing values of
x, but then start to increase with even greater values of x. Hence, a linear regression
will not give a good fit to such data and would be inappropriate.
Interaction: Pulldown: two clusters:
There are two clusters of points, with what appears to be random scatter in each
cluster. This may suggest some kind of threshold in x, below which y takes one
average value and above which y takes another average value.
Interaction: Pulldown: two lines:
There appear to be two different lines shown here. It may well be that the value of y
depends on x and on another binary variable.
Section 5: Categorical variables

The model we have fitted assumes that birth weight is linearly associated with
gestational age within the range of gestational age in the data. Other models we
have fitted in this course have used categories or grouped data to examine the
relationship between an outcome variable, e.g. log odds and an explanatory variable.
We can do this with a quantitative outcome too.
5.1: Categorical variables

For our birth weight data we can group the data according to gestational age.
Categories
gestational age
Freq.
Mean of birth
weights
Standard deviation
of birth weights
<38 wks
38-39 wks
39-40 wks
>40 wks
157
130
166
188
2,445.567
3,186.023
3,365.193
3,452.223
705.815
408.357
456.661
441.366
The means of birth weight for the four groups increase sequentially from around 2.5
kg to 3.5 kg.

We can fit a simple linear regression model of birth weight with three indicators for
the highest three groups of gestational age and estimate the mean differences
between each of these three groups and the first (i.e. the <38 week category).
bweight
Coef.
Std. Err.
P>t
gestcat
2
3
4
740.4562
919.6259
1006.657
61.27115
57.522
55.86212
12.08
15.99
18.02
0.000
0.000
0.000
620.1383
806.6702
896.9604
860.7741
1032.582
1116.353
_cons
2445.567
41.23697
59.31
0.000
2364.59
2526.544

bweight
Coef.
Std. Err.
P>t
gestcat
2
3
4
740.4562
919.6259
1006.657
61.27115
57.522
55.86212
12.08
15.99
18.02
0.000
0.000
0.000
620.1383
806.6702
896.9604
860.7741
1032.582
1116.353
_cons
2445.567
41.23697
59.31
0.000
2364.59
2526.544
The intercept, estimated as 2445.567, is the predicted mean value of birth weight in
the baseline category of gestational age (i.e. the <38 week category).
The value 740.4562 for gestcat 2 is the predicted increase in birth weight among
babies in the next gestational age category (38-39 weeks) compared to the baseline
category (<38 weeks).
Similarly 919.6259 and 1006.657 are the predicted increases for the next two
categories (39-40 weeks and >40 weeks), respectively, each compared to the
baseline category (<38 weeks).
The estimated increases can be added to the estimated mean in the baseline
category to get the predicted mean value of birth weight in the four categories.
The resulting equations for the predicted values are
y = 2445.567,
y = 2445.567 + 740.4562 = 3186.023,
y = 2445.567 + 919.6259 = 3365.193 and
y = 2445.567 + 1006.657 = 3452.223 respectively for babies in the first, second,
third and fourth categories.
These values correspond exactly to those shown in the column headed Mean of birth
weights in the previous table.
Interaction: Button: Show:
Categories
gestational age
Freq.
Mean of birth
weights
Standard deviation
of birth weights
<38 wks
38-39 wks
39-40 wks
>40 wks
157
130
166
188
2445.567
3186.023
3365.193
3452.223
705.8151
408.3571
456.661
441.3662

The predicted values for this regression model can be seen on the scatter plot below.
5000
Birthweight (grams)
2000
3000
4000
1000
25
30
35
40
You can click swap to add the original line to the same graph.
Interaction: Button: Swap:
Output (figure changes to this and text appears below):
45
5000
4000
Birthweight (grams)
2000
3000
1000
0
25
30
35
40
45
We can see that the categorical variable gives a reasonable fit to the data in the
middle values of gestational age, but does particularly poorly at younger gestational
ages.
Section 6: Quadratic relationships

The first model we fitted assumed a linear relationship between birth weight and
gestational age, while the model using gestational age categories is more flexible
about the shape of the relationship between the two variables, though clearly it
makes some strong assumptions about what that relationship is, e.g. it assumes that
all babies born before 38 weeks have the same mean birth weight. Another option is
to model a quadratic relationship between birth weight and gestational age, which
would allow some departure from a linear relationship.
6.1: Quadratic relationships

Consider the (hypothetical) example below, in which it is possible that there is some
curvature.
24
22
20
18
16
14
12
10
10
10
10
12
14
16
18
20
22
24
10
12
14
16
18
20
22
24
A straight line ignores any curvature there may be between y and x.
10
10
12
14
16
18
20
22
24
Categorising x allows a non-linear relationship between y and x.
10
Fitting a quadratic allows some departure from a linear relationship.

To fit a quadratic relationship the model becomes
y = a + bx + cx2
where a is still the estimated birth weight of a baby when x=0 (i.e. with mean
gestational age) but now the slope of the line is described by two coefficients b and
c. In this case if b and c were both positive this would mean that the babys growth
accelerates over the period of gestation that we are examining. If b is positive and c
is negative then the babys growth rates would be slowing down over the period of
gestation that we are examining. Note that interpretation of the two parameter
estimates, b and c, is less straightforward than interpretation of the parameter
estimates when x was categorical.

We can fit this model by first calculating a new variable that is the square of the
mean-centred gestational age e.g. mgest2=mgest^2 and then fitting a linear
regression model on both the mean-centred gestational age and its square:
bweight
Coef.
Std. Err.
P>t
[95% Conf.
Interval]
mgest
mgest2
_cons
193.2323
-2.796836
3144.296
10.73898
1.608682
19.46009
17.99
-1.74
161.58
0.000
0.083
0.000
172.1443
-5.955789
3106.083
So the regression equation is now:

Birth weight = 3144.3 + 193.2*mgest 2.8*mgest2
= 3144.3 + 193.2*(age 38.7) 2.8*(age 38.7)2
214.3203
.3621158
3182.51
linear
1000
Birthweight (grams)
2000
3000
4000
5000
The graph below shows both the fitted linear and quadratic relationships. We can see
that between 32 and 42 weeks gestation the lines are virtually identical, and it is
only for babies born before 32 weeks that there appears to be any curvature. There
are very few babies in the dataset born before about 32 weeks. It is unsurprising
that there is little statistical evidence of departure from a linear relationship (P=0.08
for mgest2 in the output).
quadratic
30
25
35
40
45
Section 7: Multivariable regression

Just as we have done with Poisson and logistic regression models, we can include
several covariates in a linear regression model. For example, we can fit a model of
birth weight to gestational age and gender (0=male, 1=female):
bweight
Coef.
Std. Err.
P>t
mgest
gender
_cons
206.4446
-161.7075
3208.604
7.363321
34.29034
24.03779
28.04
-4.72
133.48
0.000
0.000
0.000
191.9853
-229.0431
3161.401
220.9039
-94.37194
3255.806
7.1: Multivariable regression

bweight
Coef.
Std. Err.
P>t
mgest
gender
_cons
206.4446
-161.7075
3208.604
7.363321
34.29034
24.03779
28.04
-4.72
133.48
0.000
0.000
0.000
191.9853
-229.0431
3161.401
220.9039
-94.37194
3255.806
The value for mgest of 206.4 gives us the estimated increase in birth weight for each
additional week of gestation, after adjusting for gender. The value -161.7 shows
that females (coded 1 in the data) are estimated to be 161.7 g lighter than males
(coded 0 in the data), after adjusting for week of gestation. Note that gender is not
a confounder between gestational age and birth weight here, because adjusting for it
barely changes the mgest estimate (from 206.6412 to 206.4446). However, we can
see from the p-value and confidence interval that gender is independently associated
with birth weight.
What is the regression equation fitted here?
Birth weight = 3208.6 + 206.4*mgest 161.7*gender
= 3208.6 + 206.4*(gestational age 38.7) 161.7*gender
where gender takes the value 0 for males and 1 for females

We can plot two separate lines of predicted birth weight by gestational age for
males and females on the scatter plot.
5000
4000
Birthweight (grams)
2000
3000
difference
is 161.7g
1000
males
females
25
30
35
40
45
This makes clear that we are fitting two parallel lines to the data, with equal
gradients but different intercepts.
What are the regression equations for males and females separately?
Males
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*0
= 3208.6 + 206.4*(gestational age 38.7)
Females
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*1
= 3208.6 161.7 + 206.4*(gestational age 38.7)
= 3046.9 + 206.4*(gestational age 38.7)
So we can see that the gradients are the same, but the intercepts are different.

When there is more than one covariate in the model, the goodness-of-fit checking
should also include plots of the residuals against each covariate. With one
covariate this is not necessary, as
y is determined by x.

Note that the size of the effect estimate depends on the units of the covariate. For
example, here gestational age is measured in weeks, but if it had been measured in
days the parameter for gestational age would be 7 times smaller (since it is the
estimated increase per day not per week). It is useful to consider the clinical
impact of the covariate on the outcome in the wider population. For example, a
baby born at 32 weeks is estimated to be 1.65kg lighter than a baby born at 40
weeks (206.44 x -8 = -1652g).
Section 8: Analysis of variance for goodness of fit

When testing for the significance of an association between the quantitative
outcome and a covariate that has a single parameter, we can use the p-value for
the parameter in the output table, as we did for the covariate gender above.
However, if we want to test the impact of several parameters simultaneously we
need to use an F test from the analysis of variance table. The F test in linear
regression is analogous to the likelihood ratio test in logistic or Poisson regression.
8.1: Analysis of variance for goodness of fit

A measure that allows us to evaluate how well a particular linear regression model
fits the data is the sum of squares of the differences between the observed values
of the outcome variable, and the values predicted by the model, i.e. the residual
sum of squares,
( y i
y i)2
This is obtained from an analysis of variance table. For example, the analysis of
variance table produced when fitting the model with gestational age categories
was:
Source
SS
df
MS
Model
Residual
102656030
170064092
3
637
34218676.7
266976.596
Total
272720122
640
426125.19
Here the residual sum of squares (SS) is 170064092.

Source
SS
df
MS
Model
102656030
34218676.7
Residual
170064092
637
266976.596
Total
272720122
640
426125.19
If the residual sum of squares is zero the regression line would fit the data
perfectly, that is every observed point would lie on the fitted line. By contrast
larger values would indicate worse fits, since the deviations of the observed points
from the regression line will be larger. Two possible factors contribute to a high
residual sum of squares; either there is a lot of variation in the data, i.e. 2 is large,
or the model does not explain much of the variation observed.

Source
SS
df
MS
Model
Residual
102656030
170064092
3
637
34218676.7
266976.596
Total
272720122
640
426125.19
We can obtain an estimate of 2 as:

2 = residual sum of squares/df
where df is the degrees of freedom.
This can be calculated as:
170064092/637 = 266976.596
This value is known as the residual mean square (MS shown in the last column in
the table), and its square root is called the Root MSE (mean square error = 516.7).
The larger the Model MS compared to the Residual MS, the better the model fits the
data. Under the Null hypothesis of no effect of gestational age, the Model MS and
the Residual MS are two independent estimates of 2 and their ratio is expected to
be 1. We can then test the significance of the current model, by dividing these two
terms.
F=
Model MS
Residual MS
This is known as an F test. Under a null hypothesis of no effect of gestational age

on birth weight, this statistic would be expected to follow the F distribution with the
appropriate degrees of freedom. This is written as F (3, 637), as we have fitted
three parameters for the age categories and we are then left with 641 observations
minus these three parameters minus the constant, leaving 637 for the residual SS.
Calculate the F statistic for the Null hypothesis of no relationship between birth
weight and gestational age to one decimal place.

Incorrect answer:
No.
F = Model MS = 34218676.7 / 266976.596 = 128.2
Residual MS
Correct answer:
Correct
Yes.
F=
Model MS = 34218676.7 / 266976.596 = 128.2

Residual MS

If there is a strong relationship between the outcome and the exposure variables
used in the model, then the Model MS will be much larger than the Residual MS,
and F will be (substantially) greater than 1. If there is no relationship then, on
average, the Model MS will be equal to the Residual MS and F will be 1. In this
example the value of F is much larger than 1, so there is evidence that birth weight
changes with gestational age. The p-value for the null hypothesis that there is no
relationship, based on F = 128.2 is p<0.0001.

If we include gender in the model with gestational age as categorical, we get the
output below.
Source
SS
df
MS
Model
Residual
106953800
165766322
4
636
26738450
260638.871
Total
272720122
640
426125.19
bweight
Coef.
Std. Err.
P>t
gestcat
2
3
4
736.1429
909.2086
1012.249
60.54885
56.89301
55.21226
12.16
15.98
18.33
0.000
0.000
0.000
617.243
797.4877
903.8286
855.0427
1020.929
1120.669
gender
_cons
-164.2448
2528.212
40.44732
45.54496
-4.06
55.51
0.000
0.000
-243.6713
2438.776
-84.81839
2617.649
The analysis of variance table for this model is at the top. For this model, our null
hypothesis is that there is no effect of gestational age (considered in categories) or
sex. A test of whether the true effects of both explanatory variables are zero is
made by examining F(4, 636) = 102.59 and referring this to the F distribution. The
p-value for this hypothesis is again <0.001, indicating strong evidence that the two
exposure variables gestational age categories and gender taken together contribute
to explain the data variability.

Adjusting for gender has made very little difference to the parameter estimates for
gestational age so gender is not a confounder here. But for demonstration, in
order to assess whether gestational age (as categorical) is associated with birth
weight after adjusting for gender we need to do a partial F test.
The F statistic is F(3, 636) = 131.06, which gives a p-value of <0.001, so there is
very strong evidence of an association between gestational age category and birth
weight independently of gender.
This partial F test is analogous to the likelihood ratio test in logistic or Poisson
regression. With a likelihood ratio test you would need to fit one model with
gestational age and gender and compare it to a model with just gender, but you
dont need to do that with linear regression.
Section 9: Summary
This is the end of AS13.
The main points of this session will appear below as you click on the relevant title.
Fitting regression equation
Fitting a linear regression equation to data from two quantitative variables uses
one of two approaches, both of which give the same result: least squares
minimises the sum of squares of residuals, while maximum likelihood assumes
the residuals follow a Normal distribution with mean=0 and standard deviation=1.
Checking assumptions
Linearity can be checked by a scatter-plot of the outcome and explanatory
variables, Normality of residuals can be checked with a histogram of standardised
residuals and constancy of the variance of the residuals can be checked by
examining a scatter-plot of the residuals plotted against the fitted values.
Allowing for a non-linear relationship
Two common ways of allowing for a non-linear relationship are to either:
categorise the explanatory variable or to allow for a quadratic relationship by
including both the explanatory variable and its square in the regression.
Multivariable linear regression

As for logistic, Poisson and Cox regression, multiple variables can be included in
the regression equation to produce estimates that are adjusted for the other
variables. The partial F test is used instead of the likelihood ratio test to examine
the evidence for some variables adjusted for other variable(s).

As 13

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

As 13

Enviado por

Direitos autorais:

Formatos disponíveis

Analysis of quantitative outcomes

Course: PG Diploma/ MSc Epidemiology

Section 1: Analysis of quantitative outcomes

Section 2: Planning your study

2.1: Planning your study

Section 3: Background - correlation

What statistic could we calculate to determine the strength of association

3.1: Background - correlation

3.2: Background - correlation

What does this indicate?

3.3: Background why do we need linear regression?

3.4: Background linear regression

3.5: Background linear regression

3.6: Background least squares

Interaction: Calculation: (calc)

3.7: Background least squares

3.8: Background likelihood theory

3.9: Background interpreting output

3.10: Background interpreting output

[95% Conf. Interval]

How would you interpret the constant coefficient of 3129.137?

3.11: Background interpreting output

[95% Conf. Interval]

3.12: Background regression equation

[95% Conf. Interval]

Section 4: Checking model assumptions

4.1: Normality assumption

The histogram looks symmetrical and reasonably bell-shaped, therefore the

4.2: Constant variance

Predicted (or fitted) value

There is no obvious relationship between the residuals and

4.3: Linear relationship

Section 5: Categorical variables

5.1: Categorical variables

5.2: Categorical variables

[95% Conf. Interval]

5.3: Categorical variables

[95% Conf. Interval]

third and fourth categories.

5.4: Categorical variables

Section 6: Quadratic relationships

6.1: Quadratic relationships

A straight line ignores any curvature there may be between y and x.

Categorising x allows a non-linear relationship between y and x.

Fitting a quadratic allows some departure from a linear relationship.

6.2: Quadratic relationships

6.3: Quadratic relationships

So the regression equation is now:

6.4: Quadratic relationships

Section 7: Multivariable regression

[95% Conf. Interval]

7.1: Multivariable regression

[95% Conf. Interval]

7.2: Multivariable regression

7.3: Multivariable regression

7.4: Multivariable regression

Section 8: Analysis of variance for goodness of fit

8.1: Analysis of variance for goodness of fit

Here the residual sum of squares (SS) is 170064092.

8.2: Analysis of variance for goodness of fit

8.3: Analysis of variance for goodness of fit

We can obtain an estimate of 2 as:

This is known as an F test. Under a null hypothesis of no effect of gestational age