DM 02

Data Mining with Regression
Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
Wharton
Department of Statistics
Review from last time Missing data Stepwise regression
Topics for Today
Any questions, comments? Deciding which variables improve a model
Growing regression models
Wharton
Why use regression?

Claim
Regression is capable of matching the predictive performance of black-box models Just a question of having the right Xs Recognize then x problems Shares problems with black-boxes Opportunity to appreciate what happens in models with less apparent structure Patches in Foster and Stine 2004
Regression is familiar
Familiarity allows improvements

Wharton
How do we expand a regression model Automate what is usually done by hand
Modeling Question
beyond obvious variables to include those that are important but subtle?
Iterative improvement Try variable, diagnose, try another, diagnose Open modeling process to allow a surprise Example: Include interactions Magnied scope also magnies problems
Computing allows more expansive search

transformations, combinations (e.g. ratios), bundles (e.g. prin comp)
Wharton
Start with simple regr, expand to multiple Calibration

Post FT Obama on Pre FT Obama Add Happy/Sad and Care Who Wins Include interaction effect Being right on average What terms are signicant? Whats the interaction mean? Show what interaction does
Review ANES Example
avg(Y|)=
Interpreting model
Visual exploration as supplement to numerical

Wharton
proling
5
Numerical response Diagnosing severity of osteoporosis Response
Medical Example
Brittle bones due to loss of calcium Leads to fractures and subsequent complications Personal interest X-ray measurement of bone density Standardized to N(0,1) for normal Possible to avoid expense of x-ray, triage? Data set designed by committee
doctors, biochemists, epidemiologists
Explanatory variables
Wharton
Sample of postmenopausal women

Nursing homes Dependence? Missing data Measurement error 1,232 women with 126 columns ideal data?
Osteo Data
Marginal distributions
X-ray scores (zHip), weight, age...
Wharton
Simple regression
Impact of weight
Initial Osteo Model

zHip on which variable? How would you decide
pick largest correlation consult science
Wharton
Interpretation?
8
What to add next? Add them all?
Expanding Model
Residual analysis Add others and see what sticks Singularities mean redundant combinations Summary of t Big R2 but what happened to rest of data?
Omits many possible important effects
Wharton
Fit changes when add variables

Exclude
Missing Data
Collinearity among explanatory variables Different subsets of cases
What to do about the missing cases

Listwise deletion Pairwise deletion
Impute. Fill them in, perhaps several times
Imputation relies on missing data to

resemble those included.
Wharton
Real data is seldom (if ever) missing at random
10
Add another variable
Handle Missing Data

Add indicator column for missing values Fill the missing value with average of those seen Expands the domain of the feature search Allows missing cases to behave differently Conservative evaluation of variable Distinguish missing subsets only if predictive Missing form another category
11
Simple, reduced assumption approach Part of the modeling process Categorical: not a problem
Wharton
Simple regression illustration
Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing at random, 1=3
Complete b0 b1 Est -0.25 3.05 Filled In b0 b1 Est -1.5 3.01 SE 1.4 0.27
12
40 20 0 - 20 - 40 - 10 -5 0 5 10
SE 1.0 0.17
Wharton
Simple regression illustration

80 60
Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing that follow steeper line
Requires robust variance estimate
40 20 0 - 20 - 40
Filled In b0 b1
- 10 -5 0 5 10
Est 2.8 2.89
SE 2.1 0.41
13
Wharton
Been around for a long time Reference
Notes on Procedure
Well suited to data mining when need to search for predictive features Paul Allisons Sage monograph on Missing Data (Sage # 136, 2002). J Amer. Statist. Assoc., 91, 222230 Hes not too fond, but he thinks missing data are missing at random.
14
For a critical view, see Jones, M. P. (1996)
Wharton
Fill in missing data
Expanded Osteo Data

Grows from 126 to 217 columns
Do in R
Saturated model results

Full sample but so few signicant effects Still missing interactions
Wharton
15
Need a better approach
Stepwise Regression
Cannot always t the saturated model Saturated excludes vars that might be useful Find variable that improves the current model the most Add it as another explanatory variable if the improvement is signicant. Common in data mining with many possible Xs One step ahead, not all possible models Requires caution to use effectively
16
Mimic manual procedure Greedy search

Wharton
Predict the stock market Response Goal

Explanatory variables
Stepwise Example
Daily returns on the S&P 500 in late 2011 Predict November using August-October Technical trading rules based on past properties of the market Cup-and-handle example
17
Wharton
Allow all possible interactions, 90 possible

Start with 12 rules Add 12 squares Add 12*11/2 = 66 interactions Principle of marginality
Response surface in JMP
Forward Stepwise
Forward search
Greedy search says to add most predictive Problem is when to stop? What threshold for the p-value? Suppose follow convention and set =0.05?
Use statistical signicance

Wharton
18
P-value threshold Model chosen
Stepwise Model
Set to 0.05 as in common modeling Let run until none meet this threshold 15 variables, some just interaction components
Wharton
19
Compare claimed to actual performance

R2 = 62% with RMSE = 0.0165 How well does it predict November?
Judging the Model
SD of prediction errors twice as large as

model claimed What went wrong?
Wharton
20
Examine the denition of the technical

trading rules used in the model
Explanation
Why did the stepwise get this so wrong?

Problem is classic example of over-tting Tukey Optimization capitalizes on chance
Problem is not with stepwise

Wharton
Rather it lies with our use of classical statistics =0.05 intended for one test, not 90
21
How to get it right?

Three approaches Bonferroni rule
Avoid stepwise altogether Reserve a validation sample (cross-validation) Use a higher threshold for selection Set the p-value based on the scope of the search Searching 90 variables, then set the threshold to

0.05/90 0.00056 Result of stepwise search?
Wharton
Result: Bonferroni gets it right Nothing is added to the model!

22
Missing data Over-tting
Take-Aways
Fill in with an added indicator for missingness Model includes things that appear to predict the response but in fact do not Greedy forward search for features that mimics what we do manually when modeling Expansive scope that includes interactions Bonferroni: Set p-to-enter = 0.05/(# possible)
23
Stepwise regression
Wharton
Missing data
Assignment
What do you do with them now?
Try doing stepwise regression with your

own software.
Does your software offer robust variance estimates (aka White or Sandwich estimates)
Take a look at the ANES data

Wharton
24
Review of over-tting
Next Time
What it is and why it matters Role of Bonferroni Model selection criteria: AIC, BIC, Cross-validation Shrinkage
Other approaches to avoiding over-tting
Wharton
25

DM 02

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

DM 02

Enviado por

Direitos autorais:

Formatos disponíveis

Data Mining with Regression

Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Review from last time Missing data Stepwise regression

Topics for Today

Any questions, comments? Deciding which variables improve a model

Growing regression models

Why use regression?

Familiarity allows improvements

How do we expand a regression model Automate what is usually done by hand

Computing allows more expansive search

Start with simple regr, expand to multiple Calibration

Review ANES Example

Visual exploration as supplement to numerical

Numerical response Diagnosing severity of osteoporosis Response

Sample of postmenopausal women

X-ray scores (zHip), weight, age...

Initial Osteo Model

What to add next? Add them all?

Fit changes when add variables

Collinearity among explanatory variables Different subsets of cases

What to do about the missing cases

Impute. Fill them in, perhaps several times

Imputation relies on missing data to

Real data is seldom (if ever) missing at random

Add another variable

Handle Missing Data

Simple regression illustration

Simple regression illustration

Requires robust variance estimate

Est 2.8 2.89

Been around for a long time Reference

For a critical view, see Jones, M. P. (1996)

Fill in missing data

Expanded Osteo Data

Saturated model results

Full sample but so few signicant effects Still missing interactions

Need a better approach

Mimic manual procedure Greedy search

Predict the stock market Response Goal

Allow all possible interactions, 90 possible

Use statistical signicance

P-value threshold Model chosen

Compare claimed to actual performance

Judging the Model

SD of prediction errors twice as large as

Examine the denition of the technical

Why did the stepwise get this so wrong?

Problem is not with stepwise

How to get it right?

Result: Bonferroni gets it right Nothing is added to the model!

Missing data Over-tting

What do you do with them now?

Try doing stepwise regression with your

Take a look at the ANES data

Other approaches to avoiding over-tting

Você também pode gostar