Escolar Documentos
Profissional Documentos
Cultura Documentos
Variable Selection 2
20.08.2014
Aims of todays lecture
90
75
74 minst
0.85
66
190
maxst
0.95 0.93
130
95
avat
0.91 0.84 0.91
80
minat
60 70
0.47 0.68 0.59 0.57
150 200
maxat
0.82 0.82 0.87 0.87 0.78
avh
93 96
0.19 0.17 0.16 0.10 0.12
0.042
minh
30 60
0.67 0.53 0.53
0.34 0.19 0.30 0.17
maxh
340 440
0.76 0.48 0.68 0.66 0.065
0.54 0.27
0.91
600
wind
0.41 0.35 0.22
0.13 0.15
0.092 0.093 0.09
0.03
100
evap
30
0.77 0.54 0.69 0.72 0.33 0.71 0.19
0.67 0.83 0.05
0
75 90 130 190 60 70 93 96 340 440 0 30
Example: The evaporation data
> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> allpossregs(evap~.,data=evap.df)
Example: The evaporation data
Call:
lm(formula = evap ~ ., data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.074877 130.720826 -0.414 0.68164
avst 2.231782 1.003882 2.223 0.03276 *
minst 0.204854 1.104523 0.185 0.85393
maxst -0.742580 0.349609 -2.124 0.04081 *
avat 0.501055 0.568964 0.881 0.38452
minat 0.304126 0.788877 0.386 0.70219
maxat 0.092187 0.218054 0.423 0.67505
avh 1.109858 1.133126 0.979 0.33407
minh 0.751405 0.487749 1.541 0.13242
maxh -0.556292 0.161602 -3.442 0.00151 **
wind 0.008918 0.009167 0.973 0.33733
---
Residual standard error: 6.508 on 35 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023
F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11
Example: The evaporation data
rssp sigma2 adjRsq Cp AIC BIC CV
1 3071.255 69.801 0.674 30.519 76.519 80.177 301.047
2 2101.113 48.863 0.772 9.612 55.612 61.098 220.819
3 1879.949 44.761 0.791 6.390 52.390 59.705 206.121
4 1696.789 41.385 0.807 4.065 50.065 59.208 207.489
5 1599.138 39.978 0.813 3.759 49.759 60.731 209.986
6 1552.033 39.796 0.814 4.647 50.647 63.448 215.114
7 1521.227 40.032 0.813 5.920 51.920 66.549 231.108
8 1490.602 40.287 0.812 7.197 53.197 69.654 243.737
9 1483.733 41.215 0.808 9.034 55.034 73.321 257.513
10 1482.277 42.351 0.802 11.000 57.000 77.115 275.554
avst minst maxst avat minat maxat avh minh maxh wind
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 1 0 0 1 0
3 0 0 0 0 0 1 0 0 1 1
4 1 0 1 0 0 1 0 0 1 0
5 1 0 1 0 0 1 0 1 1 0
6 1 0 1 0 0 1 0 1 1 1
7 1 0 1 1 0 0 1 1 1 1
8 1 0 1 1 0 1 1 1 1 1
9 1 0 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Example: The evaporation data
Cp Plot
30
25
20
Cp
15
1,2,3,4,5,6,7,8,9,10
6,9
10
1,3,4,5,6,7,8,9,10
1,3,4,6,7,8,9,10
6,9,10
1,3,4,7,8,9,10
1,3,6,8,9,10
5
1,3,6,9 1,3,6,8,9
2 4 6 8 10
Number of variables
Example: The evaporation data
Call:
lm(formula = evap ~ maxat + maxh + wind, data = evap.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.901800 24.624411 5.032 9.60e-06 ***
maxat 0.222768 0.059113 3.769 0.000506 ***
maxh -0.342915 0.042776 -8.016 5.31e-10 ***
wind 0.015998 0.007197 2.223 0.031664 *
---
Residual standard error: 6.69 on 42 degrees of freedom
Multiple R-squared: 0.805, Adjusted R-squared: 0.7911
F-statistic: 57.8 on 3 and 42 DF, p-value: 5.834e-15
Variable selection: Stepwise method
I Three variations:
I Backward Elimination (BE),
I Forward Selection (FS),
I Stepwise Regression (SE, a combination of BE and FS)
Step: AIC=-58.52
ffa ~ age + weight
2. Fit all one-variable models in turn. Pick the model with the
smallest AIC.
Coefficients:
(Intercept) weight age
3.78333 -0.02027 -0.01783
Stepwise method: Stepwise Regression
I Combination of BE and FS
I Repeat
I one step of FS
I one step of BE
I Stop when no improvement in AIC is possible.
Stepwise Regression in R
> library(R330)
> data(fatty.df)
> fatty.lm <- lm(ffa~.,data=fatty.df)
> null.lm <- lm(ffa~1,data=fatty.df)
> step(null.lm,scope=formula(fatty.lm),
direction="both")
Stepwise Regression for Evaporation data
> library(R330)
> data(evap.df)
> evap.lm <- lm(evap~.,data=evap.df)
> null.lm <- lm(evap~1,data=evap.df)
> step(null.lm,scope=formula(evap.lm),
direction="both")
Coefficients:
(Intercept) maxh maxat wind
70.53895 -0.32310 0.36375 0.01089
maxst avst
-0.48809 1.19629
Conclusions
4 1234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
Schematic for selection methods
4 1234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
Schematic for selection methods
4 1234
2 12 13 14 23 24 34
1 1 2 3 4
0 {}
A cautionary example: Body fat
http://www.builtlean.com/
A cautionary example: Body fat
I 14 explanatory variables:
Abdomen, Age, Ankle, Biceps, BMI, Chest, Forearm,
Height, Hip, Knee, Neck, Thigh, Weight, Wrist
Body fat: Assume true model
Call:
lm(formula = PBF ~ Age + BMI + Neck + Chest + Abdomen
+ Hip + Thigh + Forearm + Wrist, data = body.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.58236 6.92503 0.517 0.6054
Age 0.06639 0.02783 2.385 0.0179 *
BMI 0.45691 0.23431 1.950 0.0524 .
Neck -0.38219 0.20718 -1.845 0.0663 .
Chest -0.16948 0.08977 -1.888 0.0603 .
Abdomen 0.79495 0.08233 9.656 < 2e-16 ***
Hip -0.26434 0.11641 -2.271 0.0241 *
Thigh 0.18358 0.12345 1.487 0.1383
Forearm 0.26670 0.18125 1.471 0.1425
Wrist -1.82934 0.45469 -4.023 7.73e-05 ***
---
Residual standard error: 3.92 on 236 degrees of freedom
Multiple R-squared: 0.7481, Adjusted R-squared: 0.7385
F-statistic: 77.87 on 9 and 236 DF, p-value: < 2.2e-16
Body fat: Simulation study
I Using R, generate 200 sets of data from this model, using the
same X s but new random errors each time.
120
100
140
60
40
100
120
80
50
100
80
30
40
60
Frequency
Frequency
Frequency
Frequency
Frequency
80
60
30
20
60
40
40
20
40
10
20
20
10
20
0
0
100 50 0 50 0.00 0.05 0.10 0.15 0.3 0.2 0.1 0.0 0.1 0.2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
70
50
100
40
60
40
60
80
50
30
30
Frequency
Frequency
Frequency
Frequency
Frequency
60
40
40
20
30
20
40
20
20
10
10
20
10
0
0
1.0 0.8 0.6 0.4 0.2 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5
100
150
150
40
150
80
30
100
100
100
60
Frequency
Frequency
Frequency
Frequency
Frequency
20
40
50
50
50
10
20
0
0
0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.5 0.0 0.5 0.4 0.2 0.0 0.2 0.4 0.2 0.0 0.2 0.4 0.6 3.0 2.5 2.0 1.5 1.0 0.5 0.0
I Various strategies:
0.9
0.8
0.7
0.6
0.5
0.4
0.3
training test
Georgia voting data: Comparisons of the R 2 values
Differences in R2
10
8
Frequency
6
4
2
0
Differences
What if there is no test set?
I If the data are too few to split into training and test sets, and
we chose the model using all the data and compute the R 2 , it
will most likely be too big.
I The optimism is
R 2 (SWR,data) True R 2 .
To correct the R 2 :
I Compute the empirical optimism.
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
> library(R330)
> ga.lm <- lm(undercount~.,data=ga.data)
> cross.val(ga.lm)
Cross-validated estimate of root
mean square prediction error = 243.0148
> err.boot(ga.lm)
$err
[1] 43944.99
$Err
[1] 54312.51
> sqrt(54312.51)
[1] 233.0504
Thank you!