Escolar Documentos
Profissional Documentos
Cultura Documentos
Wharton
Department of Statistics
Wharton
Department of Statistics
Regression is familiar
Modeling Question
beyond obvious variables to include those that are important but subtle?
Iterative improvement Try variable, diagnose, try another, diagnose Open modeling process to allow a surprise Example: Include interactions Magnied scope also magnies problems
Wharton
Department of Statistics
avg(Y|)=
Interpreting model
proling
5
Medical Example
Brittle bones due to loss of calcium Leads to fractures and subsequent complications Personal interest X-ray measurement of bone density Standardized to N(0,1) for normal Possible to avoid expense of x-ray, triage? Data set designed by committee
doctors, biochemists, epidemiologists
Explanatory variables
Wharton
Department of Statistics
Osteo Data
Marginal distributions
Wharton
Department of Statistics
Simple regression
Impact of weight
Wharton
Department of Statistics
Interpretation?
8
Expanding Model
Residual analysis Add others and see what sticks Singularities mean redundant combinations Summary of t Big R2 but what happened to rest of data?
Omits many possible important effects
Wharton
Department of Statistics
Missing Data
10
Simple, reduced assumption approach Part of the modeling process Categorical: not a problem
Wharton
Department of Statistics
Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing at random, 1=3
Complete b0 b1 Est -0.25 3.05 Filled In b0 b1 Est -1.5 3.01 SE 1.4 0.27
12
40 20 0 - 20 - 40 - 10 -5 0 5 10
SE 1.0 0.17
Wharton
Department of Statistics
Example of Procedure
Conservative: unbiased estimate, inated SE n=100, 0=0, 1=3 30% missing that follow steeper line
40 20 0 - 20 - 40
Filled In b0 b1
- 10 -5 0 5 10
SE 2.1 0.41
13
Wharton
Department of Statistics
Notes on Procedure
Well suited to data mining when need to search for predictive features Paul Allisons Sage monograph on Missing Data (Sage # 136, 2002). J Amer. Statist. Assoc., 91, 222230 Hes not too fond, but he thinks missing data are missing at random.
14
Wharton
Department of Statistics
Wharton
Department of Statistics
15
Stepwise Regression
Cannot always t the saturated model Saturated excludes vars that might be useful Find variable that improves the current model the most Add it as another explanatory variable if the improvement is signicant. Common in data mining with many possible Xs One step ahead, not all possible models Requires caution to use effectively
16
Stepwise Example
Daily returns on the S&P 500 in late 2011 Predict November using August-October Technical trading rules based on past properties of the market Cup-and-handle example
17
Wharton
Department of Statistics
Forward Stepwise
Forward search
Greedy search says to add most predictive Problem is when to stop? What threshold for the p-value? Suppose follow convention and set =0.05?
18
Stepwise Model
Set to 0.05 as in common modeling Let run until none meet this threshold 15 variables, some just interaction components
Wharton
Department of Statistics
19
20
Explanation
Rather it lies with our use of classical statistics =0.05 intended for one test, not 90
21
Wharton
Department of Statistics
Take-Aways
Fill in with an added indicator for missingness Model includes things that appear to predict the response but in fact do not Greedy forward search for features that mimics what we do manually when modeling Expansive scope that includes interactions Bonferroni: Set p-to-enter = 0.05/(# possible)
23
Stepwise regression
Wharton
Department of Statistics
Missing data
Assignment
24
Review of over-tting
Next Time
What it is and why it matters Role of Bonferroni Model selection criteria: AIC, BIC, Cross-validation Shrinkage
Wharton
Department of Statistics
25