330 Lecture14 2014

STATS 330: Lecture 14
Variable Selection 2
20.08.2014
Aims of todays lecture
I To look at some examples for all possible

regressions.
I To describe some further techniques for selecting
the explanatory variables for a regression.
I To compare the techniques and apply them to

several examples.
Example: The evaporation data
I This data set was discussed in Tutorial 2.
I The response variable was evap, the amount of moisture

evaporating from the soil in a 24 hour period.
I There are 10 explanatory variables, six measurements of

temperatures, three measurements of humidity, and the
average wind speed.
66 74 80 95 150 200 30 60 100 600

avst

90

75

74 minst

0.85

66
190

maxst

0.95 0.93

130

95

avat

0.91 0.84 0.91

80

minat

60 70

0.47 0.68 0.59 0.57

150 200

maxat

0.82 0.82 0.87 0.87 0.78

avh
93 96
0.19 0.17 0.16 0.10 0.12

0.042

minh
30 60

0.67 0.53 0.53

0.34 0.19 0.30 0.17

maxh
340 440

0.76 0.48 0.68 0.66 0.065
0.54 0.27
0.91

600

wind

0.41 0.35 0.22
0.13 0.15

0.092 0.093 0.09

0.03

100
evap
30
0.77 0.54 0.69 0.72 0.33 0.71 0.19
0.67 0.83 0.05
0
75 90 130 190 60 70 93 96 340 440 0 30
I There are strong relationships between the variables, so we

probably do not need them all. We can perform an all possible
regressions analysis using the following code.
> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> allpossregs(evap~.,data=evap.df)
Call:
lm(formula = evap ~ ., data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.074877 130.720826 -0.414 0.68164
avst 2.231782 1.003882 2.223 0.03276 *
minst 0.204854 1.104523 0.185 0.85393
maxst -0.742580 0.349609 -2.124 0.04081 *
avat 0.501055 0.568964 0.881 0.38452
minat 0.304126 0.788877 0.386 0.70219
maxat 0.092187 0.218054 0.423 0.67505
avh 1.109858 1.133126 0.979 0.33407
minh 0.751405 0.487749 1.541 0.13242
maxh -0.556292 0.161602 -3.442 0.00151 **
wind 0.008918 0.009167 0.973 0.33733
---
Residual standard error: 6.508 on 35 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023
F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11
rssp sigma2 adjRsq Cp AIC BIC CV
1 3071.255 69.801 0.674 30.519 76.519 80.177 301.047
2 2101.113 48.863 0.772 9.612 55.612 61.098 220.819
3 1879.949 44.761 0.791 6.390 52.390 59.705 206.121
4 1696.789 41.385 0.807 4.065 50.065 59.208 207.489
5 1599.138 39.978 0.813 3.759 49.759 60.731 209.986
6 1552.033 39.796 0.814 4.647 50.647 63.448 215.114
7 1521.227 40.032 0.813 5.920 51.920 66.549 231.108
8 1490.602 40.287 0.812 7.197 53.197 69.654 243.737
9 1483.733 41.215 0.808 9.034 55.034 73.321 257.513
10 1482.277 42.351 0.802 11.000 57.000 77.115 275.554
avst minst maxst avat minat maxat avh minh maxh wind
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 1 0 0 1 0
3 0 0 0 0 0 1 0 0 1 1
4 1 0 1 0 0 1 0 0 1 0
5 1 0 1 0 0 1 0 1 1 0
6 1 0 1 0 0 1 0 1 1 1
7 1 0 1 1 0 0 1 1 1 1
8 1 0 1 1 0 1 1 1 1 1
9 1 0 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Cp Plot
30
25
20
Cp
15
1,2,3,4,5,6,7,8,9,10

6,9
10
1,3,4,5,6,7,8,9,10

1,3,4,6,7,8,9,10
6,9,10

1,3,4,7,8,9,10

1,3,6,8,9,10
5
1,3,6,9 1,3,6,8,9

2 4 6 8 10
Number of variables
Call:
lm(formula = evap ~ maxat + maxh + wind, data = evap.df)
---
Coefficients:
(Intercept) 123.901800 24.624411 5.032 9.60e-06 ***
maxat 0.222768 0.059113 3.769 0.000506 ***
maxh -0.342915 0.042776 -8.016 5.31e-10 ***
wind 0.015998 0.007197 2.223 0.031664 *
---
F-statistic: 57.8 on 3 and 42 DF, p-value: 5.834e-15
Variable selection: Stepwise method
I In the previous lecture, we mentioned a second class of

methods for variable selection: stepwise methods
I The idea here is to perform a sequence of steps to eliminate

variables from the regression, or add variables to the
regression (or perhaps both).
I Three variations:
I Backward Elimination (BE),
I Forward Selection (FS),
I Stepwise Regression (SE, a combination of BE and FS)
I Invented when computing power was weak (does NOT mean

this is obsolete now)
Stepwise method: Backward elimination
1. Start with the full model with k variables.
2. Remove variables one at a time, record AIC.
3. Retain best (k 1)-variable model (smallest AIC).
4. Repeat steps 2 and 3 until no improvement in AIC or no

variable left.
Backward elimination in R
I Use R function step.
I Need to define an initial model and a scope

> library(R330)
> data(fatty.df)
> fatty.lm <- lm(ffa~.,data=fatty.df)
> step(fatty.lm,scope=formula(fatty.lm),
direction="backward")
Backward elimination: Free Fatty Acids
Start: AIC=-56.6 ### full model AIC
ffa ~ age + weight + skinfold
Df Sum of Sq RSS AIC

- skinfold 1 0.00305 0.79417 -58.524
<none> 0.79113 -56.601
- age 1 0.11117 0.90230 -55.971
- weight 1 0.52985 1.32098 -48.347
Step: AIC=-58.52
ffa ~ age + weight

<none> 0.79417 -58.524
- age 1 0.11590 0.91007 -57.799
- weight 1 0.54993 1.34410 -50.000
Stepwise method: Forward selection
1. Start with a null model.
2. Fit all one-variable models in turn. Pick the model with the
smallest AIC.
3. Fit all two-variable models that contain the variable selected

in 2. Pick the one for which the added variable gives the best
AIC.
4. Continue until adding further variables do not improve the

AIC.
Forward selection in R
I Use R function step.
I Need to define an initial model and a scope

> library(R330)
> data(fatty.df)
> null.lm <- lm(ffa~1,data=fatty.df)
> step(null.lm,scope=formula(fatty.lm),
direction="forward")
Forward selection: Free Fatty Acids
Start: AIC=-49.16 ### null model AIC
ffa ~ 1

+ weight 1 0.63906 0.91007 -57.799
+ age 1 0.20503 1.34410 -50.000
<none> 1.54913 -49.161
+ skinfold 1 0.00145 1.54768 -47.179
Step: AIC=-57.8 ### added weight

ffa ~ weight

+ age 1 0.115900 0.79417 -58.524
<none> 0.91007 -57.799
+ skinfold 1 0.007778 0.90230 -55.971
Forward selection: Free Fatty Acids
Step: AIC=-58.52 ### increased age

ffa ~ weight + age

<none> 0.79417 -58.524
+ skinfold 1 0.003046 0.79113 -56.601
Call: ### Final model

lm(formula = ffa ~ weight + age, data = fatty.df)
Coefficients:
(Intercept) weight age
3.78333 -0.02027 -0.01783
Stepwise method: Stepwise Regression
I Combination of BE and FS
I Start with null model
I Repeat
I one step of FS
I one step of BE
I Stop when no improvement in AIC is possible.
Stepwise Regression in R
> library(R330)
> data(fatty.df)
> null.lm <- lm(ffa~1,data=fatty.df)
> step(null.lm,scope=formula(fatty.lm),
direction="both")
Stepwise Regression for Evaporation data
> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> null.lm <- lm(evap~1,data=evap.df)
> step(null.lm,scope=formula(evap.lm),
direction="both")
### Final output

Call:
lm(formula = evap ~ maxh + maxat + wind + maxst + avst,
data = evap.df)
Coefficients:
(Intercept) maxh maxat wind
70.53895 -0.32310 0.36375 0.01089
maxst avst
-0.48809 1.19629
Conclusions
I APR suggested models with following variables:
I maxat, mash, wind for cross-validation;

I avst, maxst, maxat, maxh for BIC;
I avst, maxst, maxat, minh, maxh for AIC.
I Stepwise Regression proposed model with following variables:
I maxat, avst, maxst, maxh, and wind.

Caveats
I There is no guarantee that the stepwise algorithm will find the

best prediction model
I The selected model usually has an inflated R 2 , and standard

errors and p-values that are too low.
I Collinearity can make the model selected quite arbitrary;

collinearity is a data property, not a model property.
I For both methods of variable selection, do not trust p-values

from the final fitted model; resist the temptation to delete
variables that are not significant.
Schematic for selection methods
4 1234
3 123 124 134 234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
4 1234
3 123 124 134 234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
4 1234
3 123 124 134 234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
A cautionary example: Body fat
http://www.builtlean.com/
A cautionary example: Body fat
I Body fat data: see Assignment 3, 2010
I Response: percent body fat (PercentB)
I 14 explanatory variables:
Abdomen, Age, Ankle, Biceps, BMI, Chest, Forearm,
Height, Hip, Knee, Neck, Thigh, Weight, Wrist
Body fat: Assume true model
Call:
lm(formula = PBF ~ Age + BMI + Neck + Chest + Abdomen
+ Hip + Thigh + Forearm + Wrist, data = body.df)
---
Coefficients:
(Intercept) 3.58236 6.92503 0.517 0.6054
Age 0.06639 0.02783 2.385 0.0179 *
BMI 0.45691 0.23431 1.950 0.0524 .
Neck -0.38219 0.20718 -1.845 0.0663 .
Chest -0.16948 0.08977 -1.888 0.0603 .
Abdomen 0.79495 0.08233 9.656 < 2e-16 ***
Hip -0.26434 0.11641 -2.271 0.0241 *
Thigh 0.18358 0.12345 1.487 0.1383
Forearm 0.26670 0.18125 1.471 0.1425
Wrist -1.82934 0.45469 -4.023 7.73e-05 ***
---
F-statistic: 77.87 on 9 and 236 DF, p-value: < 2.2e-16
Body fat: Simulation study
I Using R, generate 200 sets of data from this model, using the
same X s but new random errors each time.
I For each set, choose a model by BE , record coefficients. If a

variable is not in chosen model, record as 0.
I Results summarized on next slide.

Body fat: Variable frequency
Variable Coefficient Count

Age 0.06639 154
Weight 0.00000 57
Height 0.00000 83
BMI 0.45691 101
Neck -0.38219 122
Chest -0.16948 130
Abdomen 0.79495 200
Hip -0.26434 149
Thigh 0.18358 96
Knee 0.00000 30
Ankle 0.00000 38
Biceps 0.00000 41
Forearm 0.26670 98
Wrist -1.82934 194
True model selected only 6 times out of 200!

Body fat: Distribution of estimates
(Intercept) Age Weight Height BMI
120
100
140
60
40
100
120
80
50
100
80
30
40
60
Frequency
Frequency
Frequency
Frequency
Frequency
80
60
30
20
60
40
40
20
40
10
20
20
10
20
0
0
100 50 0 50 0.00 0.05 0.10 0.15 0.3 0.2 0.1 0.0 0.1 0.2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
coefficient coefficient coefficient coefficient coefficient
Neck Chest Abdomen Hip Thigh

80
70
50
100
40
60
40
60
80
50
30
30
Frequency
Frequency
Frequency
Frequency
Frequency
60
40
40
20
30
20
40
20
20
10
10
20
10
0
0
1.0 0.8 0.6 0.4 0.2 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5
Knee Ankle Biceps Forearm Wrist
100
150
150
40
150
80
30
100
100
100
60
Frequency
Frequency
Frequency
Frequency
Frequency
20
40
50
50
50
10
20
0
0
0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.5 0.0 0.5 0.4 0.2 0.0 0.2 0.4 0.2 0.0 0.2 0.4 0.6 3.0 2.5 2.0 1.5 1.0 0.5 0.0

Body fat: Bias in coefficient of abdomen
I Suppose we want to estimate the coefficient of abdomen.
I Various strategies:
1. Pick model using BE, use coef of abdomen in chosen model;
2. Pick model using BIC, use coef of abdomen in chosen model;
3. Pick model using AIC, use coef of abdomen in chosen model;
4. Pick model using AdjR2, use coef of abdomen in chosen model;
5. Use coef of abdomen in full model.
I Which is best? Generate 200 data sets, and compare.

Body fat: Bias results
I Table gives 1000 MSE , i.e. average of squared differences,

averaged over 200 simulations
200
1 X
MSE = (estimate true value)2 .
200
j=1
Full BE BIC AIC S2

6.43 7.27 4.80 2.91 8.03
I Thus, AIC is best!

Estimating the optimism in R 2
I We noted (caveats slide) that the R 2 for the selected model is

usually higher than the R 2 for the model fitted to new data.
I How can we adjust for this?
I If we have plenty of data, we can split the data into a training

set and a test set, select the model using the training set,
then calculate the R 2 for the test set.
Georgia voting data: Background
I In the 2000 US presidential election, some voters had their

ballots declared invalid for different reasons.
I In this data set, the response is the undercount (the

difference between the votes cast and the votes declared
valid).
I Each observation is a Georgia county, of which there were

159. We removed 4 outliers, leaving 155 counties.
I We will consider a model with 5 explanatory variables:

undercount ~ perAA+rural+gore+bush+other
I Dataset gavote is in faraway package

Georgia voting data: Calculating the optimism
I We split the data into two parts at random, a training set of

80 and a test set of 75.
I Using the training set, we selected a model using stepwise

regression and calculated the R 2 .
I We then took the chosen model and recalculated the R 2 using

the test set. The difference is the optimism.
I We repeated for 50 random splits of 80/75, getting 50

training set R 2 s and 50 test set R 2 s.
Georgia voting data: Comparisons of the R 2 values
0.9
0.8
0.7
0.6

0.5

0.4

0.3
training test
Georgia voting data: Comparisons of the R 2 values
Differences in R2
10
8
Frequency
6
4
2
0
0.2 0.0 0.2 0.4 0.6
Differences
What if there is no test set?
I If the data are too few to split into training and test sets, and
we chose the model using all the data and compute the R 2 , it
will most likely be too big.
I Perhaps we can estimate the optimism and subtract it off the

R 2 for the chosen model, thus correcting the R 2 .
I We need to estimate the optimism averaged over all possible

data sets.
I But we have only one! How to proceed?

Estimating the optimism
I The optimism is
R 2 (SWR,data) True R 2 .
I Depends on the true unknown distribution of the data.
I Do not know this but it is approximated by the empirical

distribution which puts probability 1/n at each data point
I NB: SWR = stepwise regression.

Resampling
I We can draw a sample from the empirical distribution by

choosing a sample of size n chosen at random with
replacement from the original data (n = number of
observations in the original data).
I Also called a bootstrap sample or a resample.

Empirical Optimism
I The empirical optimism is
R 2 (SWR,original data) R 2 (SWR,resampled data).
I We can generate as many values of this estimate as we like by

repeatedly drawing samples from the empirical distribution, or
resampling.
Empirical Optimism
To correct the R 2 :
I Compute the empirical optimism.
I Repeat for say 200 resamples, average the 200 optimisms.
I This is our estimated optimism.
I Correct the original R 2 for the chosen model by subtracting

off the estimated optimism.
I This is the bootstrap corrected R 2 .

How well does it work?
1.0
0.9
0.8
0.7
0.6

0.5

0.4

0.3
training bootstrap test

Bootstrap estimate of prediction error
I Can also use the bootstrap to estimate prediction error.
I Calculating the prediction error from the training set

underestimates the error.
I We estimate the optimism from a resample.
I Repeat and average, as before.

Estimating prediction errors in R
> library(R330)
> ga.lm <- lm(undercount~.,data=ga.data)
> cross.val(ga.lm)
Cross-validated estimate of root
mean square prediction error = 243.0148
> err.boot(ga.lm)
$err
[1] 43944.99
$Err
[1] 54312.51
> sqrt(54312.51)
[1] 233.0504
Thank you!

330 Lecture14 2014

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

330 Lecture14 2014

Enviado por

Direitos autorais:

Formatos disponíveis

STATS 330: Lecture 14

I To look at some examples for all possible

I To compare the techniques and apply them to

I This data set was discussed in Tutorial 2.

I The response variable was evap, the amount of moisture

I There are 10 explanatory variables, six measurements of

I There are strong relationships between the variables, so we

I In the previous lecture, we mentioned a second class of

I The idea here is to perform a sequence of steps to eliminate

I Invented when computing power was weak (does NOT mean

1. Start with the full model with k variables.

2. Remove variables one at a time, record AIC.

3. Retain best (k 1)-variable model (smallest AIC).

4. Repeat steps 2 and 3 until no improvement in AIC or no

I Use R function step.

I Need to define an initial model and a scope

Df Sum of Sq RSS AIC

Df Sum of Sq RSS AIC

1. Start with a null model.

3. Fit all two-variable models that contain the variable selected

4. Continue until adding further variables do not improve the

I Use R function step.

I Need to define an initial model and a scope

Df Sum of Sq RSS AIC

Step: AIC=-57.8 ### added weight

Df Sum of Sq RSS AIC

Step: AIC=-58.52 ### increased age

Df Sum of Sq RSS AIC

Call: ### Final model

I Start with null model

### Final output

I APR suggested models with following variables:

I maxat, mash, wind for cross-validation;

I Stepwise Regression proposed model with following variables:

I maxat, avst, maxst, maxh, and wind.

I There is no guarantee that the stepwise algorithm will find the

I The selected model usually has an inflated R 2 , and standard

I Collinearity can make the model selected quite arbitrary;

I For both methods of variable selection, do not trust p-values

3 123 124 134 234

3 123 124 134 234

3 123 124 134 234

I Body fat data: see Assignment 3, 2010

I Response: percent body fat (PercentB)

I For each set, choose a model by BE , record coefficients. If a

I Results summarized on next slide.

Variable Coefficient Count

True model selected only 6 times out of 200!

(Intercept) Age Weight Height BMI

coefficient coefficient coefficient coefficient coefficient

Neck Chest Abdomen Hip Thigh

coefficient coefficient coefficient coefficient coefficient

Knee Ankle Biceps Forearm Wrist

coefficient coefficient coefficient coefficient coefficient

I Suppose we want to estimate the coefficient of abdomen.

1. Pick model using BE, use coef of abdomen in chosen model;

2. Pick model using BIC, use coef of abdomen in chosen model;

3. Pick model using AIC, use coef of abdomen in chosen model;

4. Pick model using AdjR2, use coef of abdomen in chosen model;

5. Use coef of abdomen in full model.

I Which is best? Generate 200 data sets, and compare.

I Table gives 1000 MSE , i.e. average of squared differences,

Full BE BIC AIC S2