Você está na página 1de 24

STATS 330: Lecture 18

Models with many continuous and


categorical explanatory variables

28.08.2014
Page 1/23
Todays agenda

I Models with many variables: general principles


and strategies

I General principles

I Lack of data

I Model selection?

I An example: Low birthweight data

Page 2/23
General Principles
I If we are modeling to gain insight into the
relationship of (some of) the explanatory
variables to the response, we must include the
important variables, and any likely confounders.
I We should allow the important categorical
variables to interact with each other and the
continuous variables, if we have enough data.
I It is vital to model the important variables well.
I If we have enough data, we could start with the
full model.

Page 3/23
The full model

I For a problem with both continuous and


categorical explanatory variables, the most
general model is to fit separate regressions for
each possible combination of the factor levels
(like the non-parallel model in the lathe
example).
I That is, we allow the categorical variables to
interact with each other and the continuous
variables.

Page 4/23
Illustration

I Two factors A and B, two continuous


explanatory variables X and Z : General model is
y A*B*X + A*B*Z
I Suppose A has a levels and B has b levels, so
there are a b factor level combinations.
I Each combination has a separate regression with
3 parameters
I Constant term
I Coefficient of X
I Coefficient of Z

Page 5/23
Illustration (cont)

I There are a b constant terms, we can arrange


them in a a b table.
I Can split the table up into main effects and
interactions as in 2 way ANOVA.
I Listed in output as (Intercept), A, B and A:B

Page 6/23
Illustration (cont)
I There are a b X -coefficients, we can also
arrange them in a a b table, with a rows and
b columns
I Can split the table up into main effects and
interactions as we did with means in 2 way
ANOVA.
I Listed in output as X, A:X, B:X and A:B:X
I If all the A:X, B:X, A:B:X interactions are zero,
coefficient of X is the same for all the a b
regressions

Page 7/23
Illustration (cont)

I There are a b W -coefficients, we can arrange


them in a a b table.
I Can split the table up into main effects and
interactions as in 2 way ANOVA.
I Listed in output as W, A:W, B:W and A:B:W

Page 8/23
Caution!!

I Sometimes we dont have enough data to fit a


separate regression to each factor level
combination (need at least one more data point
than number of continuous variables per
combination)
I In this case we drop out the higher level
interactions, forcing coefficients to have
common values.

Page 9/23
Model selection: general advice

I With lots of variables, the number of possible


models is large
I Can use variable selection techniques with
caution, but dont omit potential confounders
I If including a factor, include all dummy variables
for that factor.
I If model building for prediction, be guided by
estimates of prediction error
I Model the important variables well

Page 10/23
Example: the low birthweight data
I These data were collected at Baystate Medical
Center, Springfield, Mass. during 1986, as part
of a study to identify risk factors for
low-birthweight babies.
I There are 189 births in the study.
I The response variable was birthweight, and data
was collected on a variety of continuous and
categorical explanatory variables relating to the
mothers health.
I The data set is in the R330 package
(births.df). To keep things simple we use
only 4 of the factors in the data set.
Page 11/23
The Variables
age: age : mothers age in years, continuous
lwt: mothers weight in pounds, continuous
race: mothers race
a factor, with 1 = white, 2 = black,
3 = other.
smoke: smoking during pregnancy
a factor with 1 =smoked, 0=didnt smoke
ht: history of hypertension
a factor with 0=No, 1=Yes
ui: presence of uterine irritability
a factor with 0=No, 1=Yes
bwt: birth weight in grams, continuous, the
response
Page 12/23
Preliminary plots
5000

5000


4000

4000











baby's weight, gms

baby's weight, gms











3000

3000




















2000

2000










1000

1000

100 150 200 250 15 20 25 30 35 40 45


Mothers weight, lbs Mothers age, yrs

Page 13/23
baby's weight, gms baby's weight, gms
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000

White

No

Black
Mother's race

Mother's hypertension
Yes

Other

baby's weight, gms baby's weight, gms


1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Preliminary plots (cont)

No
No
Mother's smoking

Mother's uterine irritation


Yes
Yes

Page 14/23
Preliminary conclusions

I Some relationships between bwt and the


covariates
I Slight relationship with lwt (correlation is about
0.18)
I Some effect of race, smoking, hypertension,
uterine irritability
I On to model fitting!

Page 15/23
Model fitting
I There are 24 treatment combinations of race,
smoke, iu, ht, and 189 observations.
I The data are not evenly spread over the 24
combinations. Eight combinations have no data,
and only 13/24 have 3 or more points. Hence
we cant hope to estimate a complicated model.
Even the coefficients we can estimate will be
very inaccurate, and there is no hope of finding
many significant effects.
I We will consider fitting four simpler models, as
well as the all interactions model.

Page 16/23
The models

Model1 age*race*ui*smoke*ht + lwt*race*ui*smoke*ht


Full model
Model2 age*race + age*ui + age*smoke + age*ht +
lwt*race + lwt*ui + lwt*smoke + lwt*ht
2 factor interactions only
Model3 age*smoke + age*ht + lwt +race+ui+smoke
Retain significant interactions from model 2
Model4 age+lwt+race+ui+smoke+ht
No interactions
Model5 ui + race + smoke + ht + lwt + ht:lwt +
race:smoke
Chosen by forward selection

Page 17/23
Summary of results
R2 Adj R 2 Number of AIC
parameters
Model 1 0.428 0.263 43 2471.116
Model 2 0.314 0.246 18 2455.462
Model 3 0.276 0.239 10 2449.704
Model 4 0.241 0.212 8 2454.459
Model 5 0.269 0.233 10 2451.339

In the full model (Model1) we can estimate only 43


of the 72 =(3x24) parameters. We can estimate all
the parameters in Model 2 but the AIC is high.
Hence we focus on the last three models. They all
fit reasonably well with no overly influential points.
In the next few slides we attempt to interpret the
parameters.
Page 18/23
Model 3
This model includes some significant effects from model 2.

Estimate Std. Error t value Pr(>|t|)


(Intercept) 2479.960 351.600 7.053 3.68e-11 ***
age 13.874 11.699 1.186 0.2372
smoke 550.625 460.840 1.195 0.2337
lwt 4.240 1.678 2.527 0.0124 *
race2 -379.974 151.511 -2.508 0.0130 *
race3 -295.664 113.916 -2.595 0.0102 *
ui -565.457 134.252 -4.212 4.01e-05 ***
ht 1303.532 1057.628 1.233 0.2194
age:smoke -38.369 19.185 -2.000 0.0470 *
age:ht -83.076 45.369 -1.831 0.0687 .

The terms smoke, race2, race3, ui, ht affect the intercept of the age-lwt
regressions, smoke and ht also affect the age slope. The variables smoke
and ht raise the intercept but lower the slope. The other variables lower
the birthweight as we move away from the baselines.
Page 19/23
Model 3
Lets fix all variables except age, smoke and ht and
look at the effect of smoke and ht on the
relationship between age and birthweight.
3000
2000
centered birthweight
1000

smoke=0,ht=0
0

smoke=1,ht=0
smoke=0,ht=1
smoke=1,ht=1

15 20 25 30 35 40 45
age
Page 20/23
Model 4
This model forces a common slope for age.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2935.606 310.712 9.448 < 2e-16 ***
age -4.758 9.335 -0.510 0.610875
lwt 4.396 1.707 2.576 0.010786 *
race2 -491.675 149.159 -3.296 0.001179 **
race3 -358.616 113.834 -3.150 0.001908 **
ui -527.505 135.061 -3.906 0.000133 ***
smoke -359.367 104.007 -3.455 0.000685 ***
ht -590.037 200.251 -2.946 0.003637 **

The factors all affect the intercept of the age-lwt


regressions, lowering the birthweight as we move
from No to Yes and from White to Black.
Increasing lwt seems associated with increasing
birthweight. The effect of age seems very slight, if
anything.
Page 21/23
Model 5
This model was the one selected by forward
selection, it ignores age.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3149.946 270.774 11.633 < 2e-16 ***
ui -557.875 134.964 -4.134 5.48e-05 ***
race2 -538.352 189.439 -2.842 0.005006 **
race3 -495.021 134.440 -3.682 0.000306 ***
smoke -516.258 135.190 -3.819 0.000185 ***
ht -1812.387 715.490 -2.533 0.012166 *
lwt 2.506 1.807 1.387 0.167216
ht:lwt 8.335 4.503 1.851 0.065798 .
race2:smoke 76.638 292.516 0.262 0.793625
race3:smoke 493.675 246.585 2.002 0.046789 *

To interpret the terms in race and smoke, we need


to make a table of the joint effects:

Page 22/23
Interpreting the joint effect of smoking
and race
Lets hold all other variables fixed. What do the
various race/smoking combinations add to the
birthweight?
Race=White Race=Black Race=Other
Smoking = No 0 -538 -495
Smoking=Yes -516 -978 -518

NB: for example -978 = (approximately) -538.352 -516.258 + 76.638


= race2 + smoke + race2:smoke
This makes the joint effect clear: the combination
of black smokers is associated with a bigger drop in
the birthweight.

Page 23/23
Overall conclusions

1. Cant fit overly complex models with insufficient


data.
2. Even if we can get estimates they will be too
variable to interpret.
3. The three models examined seemed very
different but on reflection told the same story
i.e.
I Age has little or no effect on birthweight.
I Increasing mother weight is associated with an increase in
birthweight.
I The factors all are associated with a decrease in birthweight
relative to their baselines.

Page 24/23

Você também pode gostar