Você está na página 1de 47

STATS 330: Lecture 14

Variable Selection 2

20.08.2014
Aims of todays lecture

I To look at some examples for all possible


regressions.
I To describe some further techniques for selecting
the explanatory variables for a regression.

I To compare the techniques and apply them to


several examples.
Example: The evaporation data

I This data set was discussed in Tutorial 2.

I The response variable was evap, the amount of moisture


evaporating from the soil in a 24 hour period.

I There are 10 explanatory variables, six measurements of


temperatures, three measurements of humidity, and the
average wind speed.
Example: The evaporation data
66 74 80 95 150 200 30 60 100 600










avst










































90

















































75

















74 minst





























































0.85


























66

190


maxst







































































0.95 0.93








































130







95







avat
































































0.91 0.84 0.91

































80










minat















60 70

























0.47 0.68 0.59 0.57

















150 200


maxat







































0.82 0.82 0.87 0.87 0.78



















avh

93 96
0.19 0.17 0.16 0.10 0.12





0.042




























minh
30 60







0.67 0.53 0.53










0.34 0.19 0.30 0.17



















maxh

340 440





0.76 0.48 0.68 0.66 0.065
0.54 0.27
0.91





























600



wind


0.41 0.35 0.22
0.13 0.15

0.092 0.093 0.09


0.03




100

evap

30
0.77 0.54 0.69 0.72 0.33 0.71 0.19
0.67 0.83 0.05

0
75 90 130 190 60 70 93 96 340 440 0 30
Example: The evaporation data

I There are strong relationships between the variables, so we


probably do not need them all. We can perform an all possible
regressions analysis using the following code.

> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> allpossregs(evap~.,data=evap.df)
Example: The evaporation data
Call:
lm(formula = evap ~ ., data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.074877 130.720826 -0.414 0.68164
avst 2.231782 1.003882 2.223 0.03276 *
minst 0.204854 1.104523 0.185 0.85393
maxst -0.742580 0.349609 -2.124 0.04081 *
avat 0.501055 0.568964 0.881 0.38452
minat 0.304126 0.788877 0.386 0.70219
maxat 0.092187 0.218054 0.423 0.67505
avh 1.109858 1.133126 0.979 0.33407
minh 0.751405 0.487749 1.541 0.13242
maxh -0.556292 0.161602 -3.442 0.00151 **
wind 0.008918 0.009167 0.973 0.33733
---
Residual standard error: 6.508 on 35 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023
F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11
Example: The evaporation data
rssp sigma2 adjRsq Cp AIC BIC CV
1 3071.255 69.801 0.674 30.519 76.519 80.177 301.047
2 2101.113 48.863 0.772 9.612 55.612 61.098 220.819
3 1879.949 44.761 0.791 6.390 52.390 59.705 206.121
4 1696.789 41.385 0.807 4.065 50.065 59.208 207.489
5 1599.138 39.978 0.813 3.759 49.759 60.731 209.986
6 1552.033 39.796 0.814 4.647 50.647 63.448 215.114
7 1521.227 40.032 0.813 5.920 51.920 66.549 231.108
8 1490.602 40.287 0.812 7.197 53.197 69.654 243.737
9 1483.733 41.215 0.808 9.034 55.034 73.321 257.513
10 1482.277 42.351 0.802 11.000 57.000 77.115 275.554
avst minst maxst avat minat maxat avh minh maxh wind
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 1 0 0 1 0
3 0 0 0 0 0 1 0 0 1 1
4 1 0 1 0 0 1 0 0 1 0
5 1 0 1 0 0 1 0 1 1 0
6 1 0 1 0 0 1 0 1 1 1
7 1 0 1 1 0 0 1 1 1 1
8 1 0 1 1 0 1 1 1 1 1
9 1 0 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Example: The evaporation data
Cp Plot

30
25
20
Cp

15

1,2,3,4,5,6,7,8,9,10

6,9
10

1,3,4,5,6,7,8,9,10

1,3,4,6,7,8,9,10
6,9,10

1,3,4,7,8,9,10

1,3,6,8,9,10
5

1,3,6,9 1,3,6,8,9

2 4 6 8 10

Number of variables
Example: The evaporation data

Call:
lm(formula = evap ~ maxat + maxh + wind, data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.901800 24.624411 5.032 9.60e-06 ***
maxat 0.222768 0.059113 3.769 0.000506 ***
maxh -0.342915 0.042776 -8.016 5.31e-10 ***
wind 0.015998 0.007197 2.223 0.031664 *
---
Residual standard error: 6.69 on 42 degrees of freedom
Multiple R-squared: 0.805, Adjusted R-squared: 0.7911
F-statistic: 57.8 on 3 and 42 DF, p-value: 5.834e-15
Variable selection: Stepwise method

I In the previous lecture, we mentioned a second class of


methods for variable selection: stepwise methods

I The idea here is to perform a sequence of steps to eliminate


variables from the regression, or add variables to the
regression (or perhaps both).

I Three variations:
I Backward Elimination (BE),
I Forward Selection (FS),
I Stepwise Regression (SE, a combination of BE and FS)

I Invented when computing power was weak (does NOT mean


this is obsolete now)
Stepwise method: Backward elimination

1. Start with the full model with k variables.

2. Remove variables one at a time, record AIC.

3. Retain best (k 1)-variable model (smallest AIC).

4. Repeat steps 2 and 3 until no improvement in AIC or no


variable left.
Backward elimination in R

I Use R function step.

I Need to define an initial model and a scope


> library(R330)
> data(fatty.df)
> fatty.lm <- lm(ffa~.,data=fatty.df)
> step(fatty.lm,scope=formula(fatty.lm),
direction="backward")
Backward elimination: Free Fatty Acids
Start: AIC=-56.6 ### full model AIC
ffa ~ age + weight + skinfold

Df Sum of Sq RSS AIC


- skinfold 1 0.00305 0.79417 -58.524
<none> 0.79113 -56.601
- age 1 0.11117 0.90230 -55.971
- weight 1 0.52985 1.32098 -48.347

Step: AIC=-58.52
ffa ~ age + weight

Df Sum of Sq RSS AIC


<none> 0.79417 -58.524
- age 1 0.11590 0.91007 -57.799
- weight 1 0.54993 1.34410 -50.000
Stepwise method: Forward selection

1. Start with a null model.

2. Fit all one-variable models in turn. Pick the model with the
smallest AIC.

3. Fit all two-variable models that contain the variable selected


in 2. Pick the one for which the added variable gives the best
AIC.

4. Continue until adding further variables do not improve the


AIC.
Forward selection in R

I Use R function step.

I Need to define an initial model and a scope


> library(R330)
> data(fatty.df)
> fatty.lm <- lm(ffa~.,data=fatty.df)
> null.lm <- lm(ffa~1,data=fatty.df)
> step(null.lm,scope=formula(fatty.lm),
direction="forward")
Forward selection: Free Fatty Acids
Start: AIC=-49.16 ### null model AIC
ffa ~ 1

Df Sum of Sq RSS AIC


+ weight 1 0.63906 0.91007 -57.799
+ age 1 0.20503 1.34410 -50.000
<none> 1.54913 -49.161
+ skinfold 1 0.00145 1.54768 -47.179

Step: AIC=-57.8 ### added weight


ffa ~ weight

Df Sum of Sq RSS AIC


+ age 1 0.115900 0.79417 -58.524
<none> 0.91007 -57.799
+ skinfold 1 0.007778 0.90230 -55.971
Forward selection: Free Fatty Acids

Step: AIC=-58.52 ### increased age


ffa ~ weight + age

Df Sum of Sq RSS AIC


<none> 0.79417 -58.524
+ skinfold 1 0.003046 0.79113 -56.601

Call: ### Final model


lm(formula = ffa ~ weight + age, data = fatty.df)

Coefficients:
(Intercept) weight age
3.78333 -0.02027 -0.01783
Stepwise method: Stepwise Regression

I Combination of BE and FS

I Start with null model

I Repeat

I one step of FS
I one step of BE
I Stop when no improvement in AIC is possible.
Stepwise Regression in R

> library(R330)
> data(fatty.df)
> fatty.lm <- lm(ffa~.,data=fatty.df)
> null.lm <- lm(ffa~1,data=fatty.df)
> step(null.lm,scope=formula(fatty.lm),
direction="both")
Stepwise Regression for Evaporation data

> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> null.lm <- lm(evap~1,data=evap.df)
> step(null.lm,scope=formula(evap.lm),
direction="both")

### Final output


Call:
lm(formula = evap ~ maxh + maxat + wind + maxst + avst,
data = evap.df)

Coefficients:
(Intercept) maxh maxat wind
70.53895 -0.32310 0.36375 0.01089
maxst avst
-0.48809 1.19629
Conclusions

I APR suggested models with following variables:

I maxat, mash, wind for cross-validation;


I avst, maxst, maxat, maxh for BIC;
I avst, maxst, maxat, minh, maxh for AIC.

I Stepwise Regression proposed model with following variables:

I maxat, avst, maxst, maxh, and wind.


Caveats

I There is no guarantee that the stepwise algorithm will find the


best prediction model

I The selected model usually has an inflated R 2 , and standard


errors and p-values that are too low.

I Collinearity can make the model selected quite arbitrary;


collinearity is a data property, not a model property.

I For both methods of variable selection, do not trust p-values


from the final fitted model; resist the temptation to delete
variables that are not significant.
Schematic for selection methods

4 1234

3 123 124 134 234

2 12 13 14 23 24 34

1 1 2 3 4

0 {}
Schematic for selection methods

4 1234

3 123 124 134 234

2 12 13 14 23 24 34

1 1 2 3 4

0 {}
Schematic for selection methods

4 1234

3 123 124 134 234

2 12 13 14 23 24 34

1 1 2 3 4

0 {}
A cautionary example: Body fat

http://www.builtlean.com/
A cautionary example: Body fat

I Body fat data: see Assignment 3, 2010

I Response: percent body fat (PercentB)

I 14 explanatory variables:
Abdomen, Age, Ankle, Biceps, BMI, Chest, Forearm,
Height, Hip, Knee, Neck, Thigh, Weight, Wrist
Body fat: Assume true model
Call:
lm(formula = PBF ~ Age + BMI + Neck + Chest + Abdomen
+ Hip + Thigh + Forearm + Wrist, data = body.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.58236 6.92503 0.517 0.6054
Age 0.06639 0.02783 2.385 0.0179 *
BMI 0.45691 0.23431 1.950 0.0524 .
Neck -0.38219 0.20718 -1.845 0.0663 .
Chest -0.16948 0.08977 -1.888 0.0603 .
Abdomen 0.79495 0.08233 9.656 < 2e-16 ***
Hip -0.26434 0.11641 -2.271 0.0241 *
Thigh 0.18358 0.12345 1.487 0.1383
Forearm 0.26670 0.18125 1.471 0.1425
Wrist -1.82934 0.45469 -4.023 7.73e-05 ***
---
Residual standard error: 3.92 on 236 degrees of freedom
Multiple R-squared: 0.7481, Adjusted R-squared: 0.7385
F-statistic: 77.87 on 9 and 236 DF, p-value: < 2.2e-16
Body fat: Simulation study

I Using R, generate 200 sets of data from this model, using the
same X s but new random errors each time.

I For each set, choose a model by BE , record coefficients. If a


variable is not in chosen model, record as 0.

I Results summarized on next slide.


Body fat: Variable frequency

Variable Coefficient Count


Age 0.06639 154
Weight 0.00000 57
Height 0.00000 83
BMI 0.45691 101
Neck -0.38219 122
Chest -0.16948 130
Abdomen 0.79495 200
Hip -0.26434 149
Thigh 0.18358 96
Knee 0.00000 30
Ankle 0.00000 38
Biceps 0.00000 41
Forearm 0.26670 98
Wrist -1.82934 194

True model selected only 6 times out of 200!


Body fat: Distribution of estimates

(Intercept) Age Weight Height BMI

120

100
140
60

40

100
120

80
50

100

80
30
40

60
Frequency

Frequency

Frequency

Frequency

Frequency
80

60
30

20

60

40
40
20

40
10

20
20
10

20
0

0
100 50 0 50 0.00 0.05 0.10 0.15 0.3 0.2 0.1 0.0 0.1 0.2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

coefficient coefficient coefficient coefficient coefficient

Neck Chest Abdomen Hip Thigh


80

70

50

100
40
60

40
60

80
50

30

30
Frequency

Frequency

Frequency

Frequency

Frequency

60
40
40

20
30

20

40
20
20

10

10

20
10
0

0
1.0 0.8 0.6 0.4 0.2 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5

coefficient coefficient coefficient coefficient coefficient

Knee Ankle Biceps Forearm Wrist

100
150
150

40
150

80

30
100
100
100

60
Frequency

Frequency

Frequency

Frequency

Frequency

20
40
50
50
50

10
20
0

0
0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.5 0.0 0.5 0.4 0.2 0.0 0.2 0.4 0.2 0.0 0.2 0.4 0.6 3.0 2.5 2.0 1.5 1.0 0.5 0.0

coefficient coefficient coefficient coefficient coefficient


Body fat: Bias in coefficient of abdomen

I Suppose we want to estimate the coefficient of abdomen.

I Various strategies:

1. Pick model using BE, use coef of abdomen in chosen model;

2. Pick model using BIC, use coef of abdomen in chosen model;

3. Pick model using AIC, use coef of abdomen in chosen model;

4. Pick model using AdjR2, use coef of abdomen in chosen model;

5. Use coef of abdomen in full model.

I Which is best? Generate 200 data sets, and compare.


Body fat: Bias results

I Table gives 1000 MSE , i.e. average of squared differences,


averaged over 200 simulations
200
1 X
MSE = (estimate true value)2 .
200
j=1

Full BE BIC AIC S2


6.43 7.27 4.80 2.91 8.03

I Thus, AIC is best!


Estimating the optimism in R 2

I We noted (caveats slide) that the R 2 for the selected model is


usually higher than the R 2 for the model fitted to new data.

I How can we adjust for this?

I If we have plenty of data, we can split the data into a training


set and a test set, select the model using the training set,
then calculate the R 2 for the test set.
Georgia voting data: Background

I In the 2000 US presidential election, some voters had their


ballots declared invalid for different reasons.

I In this data set, the response is the undercount (the


difference between the votes cast and the votes declared
valid).

I Each observation is a Georgia county, of which there were


159. We removed 4 outliers, leaving 155 counties.

I We will consider a model with 5 explanatory variables:


undercount ~ perAA+rural+gore+bush+other

I Dataset gavote is in faraway package


Georgia voting data: Calculating the optimism

I We split the data into two parts at random, a training set of


80 and a test set of 75.

I Using the training set, we selected a model using stepwise


regression and calculated the R 2 .

I We then took the chosen model and recalculated the R 2 using


the test set. The difference is the optimism.

I We repeated for 50 random splits of 80/75, getting 50


training set R 2 s and 50 test set R 2 s.
Georgia voting data: Comparisons of the R 2 values

0.9
0.8
0.7
0.6


0.5


0.4


0.3

training test
Georgia voting data: Comparisons of the R 2 values
Differences in R2

10
8
Frequency

6
4
2
0

0.2 0.0 0.2 0.4 0.6

Differences
What if there is no test set?

I If the data are too few to split into training and test sets, and
we chose the model using all the data and compute the R 2 , it
will most likely be too big.

I Perhaps we can estimate the optimism and subtract it off the


R 2 for the chosen model, thus correcting the R 2 .

I We need to estimate the optimism averaged over all possible


data sets.

I But we have only one! How to proceed?


Estimating the optimism

I The optimism is

R 2 (SWR,data) True R 2 .

I Depends on the true unknown distribution of the data.

I Do not know this but it is approximated by the empirical


distribution which puts probability 1/n at each data point

I NB: SWR = stepwise regression.


Resampling

I We can draw a sample from the empirical distribution by


choosing a sample of size n chosen at random with
replacement from the original data (n = number of
observations in the original data).

I Also called a bootstrap sample or a resample.


Empirical Optimism

I The empirical optimism is

R 2 (SWR,original data) R 2 (SWR,resampled data).

I We can generate as many values of this estimate as we like by


repeatedly drawing samples from the empirical distribution, or
resampling.
Empirical Optimism

To correct the R 2 :
I Compute the empirical optimism.

I Repeat for say 200 resamples, average the 200 optimisms.

I This is our estimated optimism.

I Correct the original R 2 for the chosen model by subtracting


off the estimated optimism.

I This is the bootstrap corrected R 2 .


How well does it work?

1.0
0.9
0.8
0.7
0.6


0.5


0.4


0.3

training bootstrap test


Bootstrap estimate of prediction error

I Can also use the bootstrap to estimate prediction error.

I Calculating the prediction error from the training set


underestimates the error.

I We estimate the optimism from a resample.

I Repeat and average, as before.


Estimating prediction errors in R

> library(R330)
> ga.lm <- lm(undercount~.,data=ga.data)
> cross.val(ga.lm)
Cross-validated estimate of root
mean square prediction error = 243.0148

> err.boot(ga.lm)
$err
[1] 43944.99

$Err
[1] 54312.51

> sqrt(54312.51)
[1] 233.0504
Thank you!

Você também pode gostar