Você está na página 1de 25

Data Mining with Regression

Bob Stine Dept of Statistics, Wharton School University of Pennsylvania

Wharton
Department of Statistics

Review from last time Missing data Stepwise regression

Topics for Today

Any questions, comments? Deciding which variables improve a model

Growing regression models

Wharton
Department of Statistics

Why use regression?


Claim
Regression is capable of matching the predictive performance of black-box models Just a question of having the right Xs Recognize then x problems Shares problems with black-boxes Opportunity to appreciate what happens in models with less apparent structure Patches in Foster and Stine 2004

Regression is familiar

Familiarity allows improvements


Wharton
Department of Statistics

How do we expand a regression model Automate what is usually done by hand

Modeling Question

beyond obvious variables to include those that are important but subtle?
Iterative improvement Try variable, diagnose, try another, diagnose Open modeling process to allow a surprise Example: Include interactions Magnied scope also magnies problems

Computing allows more expansive search


transformations, combinations (e.g. ratios), bundles (e.g. prin comp)

Wharton
Department of Statistics

Start with simple regr, expand to multiple Calibration


Post FT Obama on Pre FT Obama Add Happy/Sad and Care Who Wins Include interaction effect Being right on average What terms are signicant? Whats the interaction mean? Show what interaction does

Review ANES Example

avg(Y|)=

Interpreting model

Visual exploration as supplement to numerical


Wharton
Department of Statistics

proling
5

Numerical response Diagnosing severity of osteoporosis Response

Medical Example

Brittle bones due to loss of calcium Leads to fractures and subsequent complications Personal interest X-ray measurement of bone density Standardized to N(0,1) for normal Possible to avoid expense of x-ray, triage? Data set designed by committee
doctors, biochemists, epidemiologists

Explanatory variables
Wharton
Department of Statistics

Sample of postmenopausal women



Nursing homes Dependence? Missing data Measurement error 1,232 women with 126 columns ideal data?

Osteo Data

Marginal distributions

X-ray scores (zHip), weight, age...

Wharton
Department of Statistics

Simple regression
Impact of weight

Initial Osteo Model


zHip on which variable? How would you decide
pick largest correlation consult science

Wharton
Department of Statistics

Interpretation?
8

What to add next? Add them all?

Expanding Model

Residual analysis Add others and see what sticks Singularities mean redundant combinations Summary of t Big R2 but what happened to rest of data?
Omits many possible important effects

Wharton
Department of Statistics

Fit changes when add variables


Exclude

Missing Data

Collinearity among explanatory variables Different subsets of cases

What to do about the missing cases


Listwise deletion Pairwise deletion

Impute. Fill them in, perhaps several times

Imputation relies on missing data to


resemble those included.
Wharton
Department of Statistics

Real data is seldom (if ever) missing at random

10

Add another variable

Handle Missing Data


Add indicator column for missing values Fill the missing value with average of those seen Expands the domain of the feature search Allows missing cases to behave differently Conservative evaluation of variable Distinguish missing subsets only if predictive Missing form another category
11

Simple, reduced assumption approach Part of the modeling process Categorical: not a problem
Wharton
Department of Statistics

Simple regression illustration

Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing at random, 1=3
Complete b0 b1 Est -0.25 3.05 Filled In b0 b1 Est -1.5 3.01 SE 1.4 0.27
12

40 20 0 - 20 - 40 - 10 -5 0 5 10

SE 1.0 0.17

Wharton
Department of Statistics

Simple regression illustration


80 60

Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing that follow steeper line

Requires robust variance estimate

40 20 0 - 20 - 40

Filled In b0 b1
- 10 -5 0 5 10

Est 2.8 2.89

SE 2.1 0.41
13

Wharton
Department of Statistics

Been around for a long time Reference

Notes on Procedure
Well suited to data mining when need to search for predictive features Paul Allisons Sage monograph on Missing Data (Sage # 136, 2002). J Amer. Statist. Assoc., 91, 222230 Hes not too fond, but he thinks missing data are missing at random.
14

For a critical view, see Jones, M. P. (1996)

Wharton
Department of Statistics

Fill in missing data

Expanded Osteo Data


Grows from 126 to 217 columns
Do in R

Saturated model results



Full sample but so few signicant effects Still missing interactions

Wharton
Department of Statistics

15

Need a better approach

Stepwise Regression
Cannot always t the saturated model Saturated excludes vars that might be useful Find variable that improves the current model the most Add it as another explanatory variable if the improvement is signicant. Common in data mining with many possible Xs One step ahead, not all possible models Requires caution to use effectively
16

Mimic manual procedure Greedy search


Wharton
Department of Statistics

Predict the stock market Response Goal


Explanatory variables

Stepwise Example

Daily returns on the S&P 500 in late 2011 Predict November using August-October Technical trading rules based on past properties of the market Cup-and-handle example
17

Wharton
Department of Statistics

Allow all possible interactions, 90 possible



Start with 12 rules Add 12 squares Add 12*11/2 = 66 interactions Principle of marginality
Response surface in JMP

Forward Stepwise

Forward search

Greedy search says to add most predictive Problem is when to stop? What threshold for the p-value? Suppose follow convention and set =0.05?

Use statistical signicance


Wharton
Department of Statistics

18

P-value threshold Model chosen

Stepwise Model

Set to 0.05 as in common modeling Let run until none meet this threshold 15 variables, some just interaction components

Wharton
Department of Statistics

19

Compare claimed to actual performance


R2 = 62% with RMSE = 0.0165 How well does it predict November?

Judging the Model

SD of prediction errors twice as large as


model claimed What went wrong?
Wharton
Department of Statistics

20

Examine the denition of the technical


trading rules used in the model

Explanation

Why did the stepwise get this so wrong?


Problem is classic example of over-tting Tukey Optimization capitalizes on chance

Problem is not with stepwise


Wharton
Department of Statistics

Rather it lies with our use of classical statistics =0.05 intended for one test, not 90
21

How to get it right?


Three approaches Bonferroni rule
Avoid stepwise altogether Reserve a validation sample (cross-validation) Use a higher threshold for selection Set the p-value based on the scope of the search Searching 90 variables, then set the threshold to

0.05/90 0.00056 Result of stepwise search?

Wharton
Department of Statistics

Result: Bonferroni gets it right Nothing is added to the model!


22

Missing data Over-tting

Take-Aways

Fill in with an added indicator for missingness Model includes things that appear to predict the response but in fact do not Greedy forward search for features that mimics what we do manually when modeling Expansive scope that includes interactions Bonferroni: Set p-to-enter = 0.05/(# possible)
23

Stepwise regression

Wharton
Department of Statistics

Missing data

Assignment

What do you do with them now?

Try doing stepwise regression with your


own software.
Does your software offer robust variance estimates (aka White or Sandwich estimates)

Take a look at the ANES data


Wharton
Department of Statistics

24

Review of over-tting

Next Time

What it is and why it matters Role of Bonferroni Model selection criteria: AIC, BIC, Cross-validation Shrinkage

Other approaches to avoiding over-tting

Wharton
Department of Statistics

25

Você também pode gostar