Você está na página 1de 42

?

Lecture 8. Build the MLR:


Diagnostics & Model Selection
IENG 314
Feng Yang
West Virginia University

1
8-1 Diagnostics: Residual Analysis
As in Lecture 5, many of the regression diagnostics are based on a
graphical examination of the residuals. Thus, it is useful to derive
some properties of the vectors of residuals and fitted values.

with H X(X' X)-1 X'

The hat matrix H is an nn matrix, and has elements hij ; we have


n
y i hij y j (i 1,2,..., n)
j 1
n

Using linear algebra, it can be shown that h


i 1
ii p 1

2
8-1 Diagnostics: Residual Analysis
The residual vector is given by: e y - y y - Hy (I - H)y
where I is the identity matrix of order n.

It can be shown that the ei are individually normally distributed


with zero means and
Var [ei ] 2 (1 hii )

Therefore, SE[ei ] 1 hii

Standardized residuals are given by:


ei ei ei
ei
*

SE[ei ] 1 hii
If | ei | 2 , the corresponding observation may be regarded an
*

outlier.
3
8-1 Diagnostics: Residual Analysis
Similar residual plots as those discussed in Lecture 5 can be made
to evaluate the goodness of fit of the model and to check if the
underlying assumptions are met.

Checking model assumptions

(1) Checking for the adequacy of the functional


form of the model
(2) Checking for constant variance
(3) Checking for normality
(4) Checking for independence

4
8-1 Diagnostics: Residual Analysis
Residual plots for regression diagnostics
(1) Plots of the residuals against individual predictor variables:
These are used to check linearity w.r.t. the xjs.
Each plot should be random
A systematic pattern indicates the necessity for adding
nonlinear terms in the corresponding xj to the model.
(2) A plot of the residuals against fitted value: This is used to
check the assumption of constant variance.
Variability in the residuals ei should be roughly constant
over the range of the fitted response values.
A systematic pattern indicates the necessity for adding
nonlinear terms in the corresponding xj to the model.

5
8-1 Diagnostics: Residual Analysis
Residual plots for regression diagnostics
(3) A normal plot of the residuals: This is used to check the
normality assumption.
(4) A run chart of the residuals: If the data are obtained
sequentially over time, this plot should be used to check
whether the random errors are autocorrelated and/or if there
are any time trends.
(5) Plots of the residuals against any omitted predictor
variables
This is used to check if any of the omitted predictor
variables should be included in the model. If a plot
shows a systematic pattern, then inclusion of that
omitted variable in the model is indicated.
6
8-1 Diagnostics: Data Transformation
As discussed in Lecture 5, transformations of the variables (both
y and xs) are often necessary to satisfy the assumptions of
linearity, normality, and constant error variance.
Many seemingly nonlinear models can be written in the multiple
linear regression model form after making a suitable
transformation. For example, the model
y 0 x11 x2 2
where () are unknown parameters, can be written in
the multiple linear regression model form if we take logarithms
of both sides. This results in the equation
log y log 0 1 log x1 2 log x2

7
8-1 Diagnostics: Multicollinearity
Multicollinearity is a statistical phenomenon in which two or
more columns of the X matrix are highly correlated.
This suggests redundant columns(or variables)
The extra columns (or variables) need to be deleted from X (or
the model) to eliminate multicollinearity

For example, consider the following predictor variables:


income, expenditure, and saving
Linear dependence: saving = income expenditure
Thus, only two of the three variables should be included in the
regression as predictor variables
percentages of different ingredients used in a product
Linear dependence: All the ingredient percentages add up to 100%
Thus, one of the ingredient percentages should not be included in the
regression.
8
8-1 Diagnostics: Multicollinearity
Multicollinearity can cause serious numerical and statistical
difficulties in multiple linear regression
The least squares estimate is ( X' X)-1 X' y
XX must be invertible (i.e., nonsingular) in order for to be
unique and computable.
If the columns of X are approximately linearly dependent, then
XX is nearly singular, which makes numerically unstable.
Most of the estimated coefficients have very large standard
errors, and as a result are statistically non-significant even if
the overall F-statistic is significant.
This is because the matrix V=(XX)-1 has very large
elements, and hence
Var [ j ] 2V jj
are large. 9
8-1 Diagnostics: Multicollinearity
Example: Cement i x1 x2 x3 x4 y
1 7 26 6 60 78.5
This table gives data on 2 1 29 15 52 74.3
the heat evolved in 3 11 56 8 20 104.3
calories during hardening 4 11 31 8 47 87.6
of cement on a per gram 5 7 52 6 33 95.9
basis (y) along with the
6 11 55 9 22 109.2
percentages of four
7 3 71 17 6 102.7
ingredients: tricalcium
8 1 31 22 44 72.5
aluminate (x1), tricalcium
9 2 54 18 22 93.1
silicate (x2), tetracalcium
10 21 47 4 26 115.9
alumino ferrite (x3), and
11 1 40 23 34 83.8
dicalcium silicate (x4).
12 11 66 9 12 113.3
13 10 68 8 12 109.4

10
8-1 Diagnostics: Multicollinearity
Example (contd)
First, note that x1, x2, x3, and x4 add up to approximately 100% for all
observations. This approximate linear relationship among the xs results
in multicollinearity.

Test the multicollinearity in JMP:

11
8-1 Diagnostics: Multicollinearity
Example (contd)
The correlation matrix among all four predictor variables is as follows.

12
8-1 Diagnostics: Multicollinearity
Variance Inflation Factors (VIF):
A direct measure of multicollinearity.
1
VIF j , j 1,2,..., p
1 rj
2

r j2 : the coefficient of multiple determination (r squared) when


regressing xj on the remaining p-1 predictor variables.

If xj is approximately linearly dependent on other predictor


2
r
variables, then j will be close to 1, and VIFj will be large.
Generally,
VIFj >10 corresponding to j 0.9
2
r
are regarded as unacceptable.
13
8-1 Diagnostics: Multicollinearity
Example: JMP Regression
JMPAnalyzeFit model:

14
8-1 Diagnostics: Multicollinearity
Example: JMP Regression

The multicollinearity exits


(all VIFs are greater than 10).
All the estimated coefficients
are nonsignificant (all the
coefficient have p-values
>0.05), although the overall
F=111.48 is highly
significant.
We will see later that the
best model includes only x1
and x2

15
8-2 Variable Selection
How to decide which variables should be included in the regression?
In practice, a common problem is that there is a large set of
candidate variables
We want to select a small subset of variables to perform
regression.

16
8-2 Variable Selection
Thus, we need methods to systematically select the best subset of
variables
Stepwise regression
Best subsets regression

How to decide which variables should be included in the regression?


The goal of variable selection is to choose a small subset of
predictors from the larger set of candidate predictors so that the
resulting regression model is simple yet useful.
That is, as always, our resulting regression model should:
provide a good summary of the trend in the response, and/or
provide good predictions of the response, and/or
provide good estimates of the slope coefficients.

17
8-2 Variable Selection: Stepwise Regression
The general idea behind stepwise regression is that we build
our model from a set of candidate predictor variables by
entering and removing predictors in a stepwise manner
into our model until there is no justifiable reason to enter or
remove any more.
This is done by taking into account the marginal contribution
of each variable to the model given the contribution of the
other variables already present in the model.
Review Slides 29-47, Lecture 7.
We need methods to systematically select the best subset of
variables

18
8-2 Variable Selection: Stepwise Regression
Overview of the stepwise regression procedure
First, we start with no predictors in our "stepwise model."
Then, at each step along the way we either enter or remove a
predictor based on the partial F-tests that is, the t-tests for
the slope parameters that are obtained.
We stop when no more predictors can be justifiably entered
or removed from our stepwise model, thereby leading us to a
"final model."

19
8-2 Variable Selection: Stepwise Regression
Stepwise regression procedure
Starting the procedure.
Set a significance level for deciding when to enter a predictor into the
stepwise model: Alpha-to-Enter denoted as E.
JMP default: E = 0.25
Also set a significance level for deciding when to remove a predictor from
the stepwise model: Alpha-to-Remove denoted as R.
JMP default: R = 0.25
Step #1.
Fit each of the one-predictor models: regress y on x1, regress y on x2, ...,
and regress y on xp.
Of those predictors whose t-test P-value is less than E, the 1st predictor
put in the stepwise model is the predictor that has the smallest t-test P-
value.
If no predictor has a t-test P-value less than E, stop.
20
8-2 Variable Selection: Stepwise Regression
Stepwise regression procedure
Step #2.
Suppose in Step #1, x1 had the smallest t-test P-value below E and
therefore was deemed the "best" one predictor arising from Step #1.
1)Now, fit each of the two-predictor models that include x1 as a
predictor: regress y on x1and x2, regress y on x1and x3, ..., and regress
y on x1 and xp
2)Of those predictors whose t-test P-value is less than E, the 2nd
predictor put in the stepwise model is the predictor that has the
smallest t-test P-value.
3)If no predictor has a t-test P-value less than E, stop. The model
with the one predictor obtained from the first step is your final
model. Otherwise, continue. Suppose that x2 was deemed the "best"
2nd predictor and it is therefore entered into the stepwise model.
21
8-2 Variable Selection: Stepwise Regression
Stepwise regression procedure
Step #2 (contd).
4) Since x1 was the 1st predictor in the model, step back and see if entering
x2 into the stepwise model somehow affected the significance of the x1
predictor. That is, check the t-test P-value for testing 1 = 0. If the t-test P-
value for 1 = 0 has become not significant that is, the P-value is greater
than R remove x1 from the stepwise model.
Step #3.
Suppose both x1 and x2 made it into the two-predictor stepwise model.
1)Now, fit each of the three-predictor models that include x1 and x2 as
predictors: that is, regress y on x1, x2, and x3, regress y on x1, x2, and x4, ...,
and regress y on x1, x2, and xp.
2)Of those predictors whose t-test P-value is less than E, the 3rd
predictor put in the stepwise model is the predictor that has the smallest
t-test P-value. 22
8-2 Variable Selection: Stepwise Regression
Stepwise regression procedure
Step #3 (contd).
3) If no predictor has a t-test P-value less than E , stop. The model
containing the two predictors obtained from the second step is your final
model. Otherwise, continue and suppose that x3 was deemed the "best" 3rd
predictor and it is therefore entered into the stepwise model.
4) Now, since x1 and x2 were the first predictors in the model, step back
and see if entering x3 into the stepwise model somehow affected the
significance of the x1 and x2 predictors. That is, check the t-test P-values
for testing 1 = 0 and 2 = 0. If the t-test P-value for either 1 = 0 or 2 = 0
has become not significant that is, the P-value is greater than R
remove
Stoppingthethe
predictor from the
procedure. stepwise
Continue themodel.
steps as described above
until adding an additional predictor does not yield a t-test P-value
below E .
23
8-2 Variable Selection: Stepwise Regression
Cement Example The scatter plot matrix of the data

From the plots, we can get a hunch of which predictors are


good candidates for being the first to enter the stepwise model.
24
8-2 Variable Selection: Stepwise Regression
Cement Example
To start our stepwise regression procedure, let's set our Alpha-to-Enter at
E = 0.15, and let's set our Alpha-to-Remove at R = 0.15.
Step #1: regressing y on x1, regressing y on x2, regressing y on x3, and
regressing y on x4, we obtain:

Smallest p value
( or largest |T| )
25
8-2 Variable Selection: Stepwise Regression
Cement Example
Step #1 (contd): Thus, we enter x4 into our stepwise model.
Step #2: We fit each of the two-predictor models that include x4 as a
predictor: we regress y on x4 and x1, regress y on x4 and x2, and regress y
on x4 and x3, obtaining:

Smallest p value
( or largest |T| )

26
8-2 Variable Selection: Stepwise Regression
Cement Example
Step #2 (contd):
As a result of Step #2, we enter x1 into our stepwise model.
Now, since x4 was the first predictor in the model, we must step
back and see if entering x1 into the stepwise model affected the
significance of the x4 predictor. It did not the t-test P-value for
testing 1 = 0 is less than 0.0001, smaller than R = 0.15. Therefore,
we proceed to the third step with both x1 and x4 as predictors in our
stepwise model
Step #3:
We fit each of the three-predictor models that include x1 and
x4 as predictors that is, we regress y on x4, x1, and x2; and
we regress y on x4, x1, and x3, obtaining:

27
8-2 Variable Selection: Stepwise Regression
Cement Example

Smallest p value
( or largest |T| )

Step #3 (contd):
As a result of the third step, we enter x2 into our stepwise model.
Since x1 and x4 were the first predictors in the model, we need to see if
entering x2 into the stepwise model affected the significance of x1 and x4.
Indeed, it did the t-test P-value for testing 4 = 0 is 0.2054, greater than
R = 0.15. Thus, we remove x4 from the stepwise model, leaving us with
28
the predictors x1 and x2 in our stepwise model:
8-2 Variable Selection: Stepwise Regression
Cement Example

Repeat Step #3: Now, we proceed fitting each of the 3-predictor models
that include x1 and x2 as predictors that is, we regress y on x1, x2, and
x3; and we regress y on x1, x2, and x4, obtaining:

29
8-2 Variable Selection: Stepwise Regression
Cement Example
Neither of the remaining predictors x3 and x4 are eligible for entry
into our stepwise model, because each t-test P-value 0.209 and 0.205,
respectively is greater than E = 0.15.
Thus, we stop our stepwise regression procedure. Our final regression
model, based on the stepwise procedure contains only the predictors x1
and x2:

30
8-2 Variable Selection: Stepwise Regression
Stepwise regression in JMP: Cement Example
JMP
stepwise
regression
output:

31
8-2 Variable Selection: Stepwise Regression
Stepwise regression in JMP: Cement Example
JMP output explanation:
Note that JMP considers a step any addition or removal of a
predictor from the stepwise model
JMP Step 1: The predictor x4 is entered into the stepwise model.
JMP Step 2: The predictor x1 is entered into the stepwise model
already containing the predictor x4.
JMP Step 3: The predictor x2 is entered into the stepwise model
already containing the predictors x1 and x4.
JMP Step 4: the predictor x4 is removed from the stepwise
model containing the predictors x1, x2, and x4, leaving us with
the final model containing only the predictors x1 and x2.
32
8-2 Variable Selection: Stepwise Regression
The final model is not guaranteed to be optimal in any specified
sense.

The procedure yields a single final model, although there are often
several equally good models.

Stepwise regression does not take into account a researcher's


knowledge about the predictors. It may be necessary to force the
procedure to include important predictors.

One should not jump to the conclusion that all the important
predictor variables for predicting y have been identified, or that all
the unimportant predictor variables have been eliminated.
It's for all of these reasons that one should be careful not to overuse or
overstate the results of any stepwise regression procedure. 33
8-2 Variable Selection: Best Subsets
The general idea behind best subsets regression is that we select the
subset of predictors that do the best at meeting some well-defined
objective criterion, such as having the largest R2 value or the smallest
MSE.
The procedure of best subsets regression
Step #1. Identify all of the possible regression models derived from
all of the possible combinations of the candidate predictors.
e.g., suppose we have three 3 candidate predictors: x1, x2, and x3,
then there are 8 possible regression models we can consider:
the one model with no predictors
the three models with only one predictor each
the three models with two predictors each
the one model with all three predictors
In general, if there are p candidate predictors, then there are 2p
possible regression models. 34
8-2 Variable Selection: Best Subsets
Step #2. From the possible models identified in Step #1, find
the one-predictor models that do the "best" at meeting some well-
defined criteria,
the two-predictor models that do the "best,"
the three-predictor models that do the "best," and so on.
It cuts down considerably the number of models to consider!
But what do we mean by "best"?
We might not be able to agree on what's best!
In thinking about what "best" means, you might have thought of
any of the following:
the model with the largest r-squared
the model with the smallest MSE
the model with the smallest AICc and
BIC(not required in this course)
The other criteria : Mallow's Cp 35
8-2 Variable Selection: Best Subsets
Step #3. Further evaluate and refine the handful of models identified
in the last step. This might entail
performing residual analyses
transforming the predictors and/or response
including additional predictor variables as candidates
considering other factors such as simplicity of the models (# of
variables included), ease of observing/controlling the variables
included in the models
Do this until you are satisfied that you have found a model that
does a good job of summarizing the trend in the data
and most importantly, allows you to address your practical concerns

36
8-2 Variable Selection: Best Subsets

SSE
r 1
2

SST

r2 can only increase as more variables are added. Thus, it makes no


sense to define the "best" model as the model with the largest r2-value.
JMP highlights the best model for each number of predictors based on
the size of the r2-value; that is, JMP highlights the one one-predictor
models with the largest r2-value, followed by the one two-predictor
models with the largest r2-value, and so on.
37
8-2 Variable Selection: Best Subsets
The r2-value increases if MSE decreases.
That is, the r2-value and MSE criteria always yield the same
"best" models.
In JMP outputs: RMSE MSE

The model with the largest r2-value (or equivalently, the smallest
RMSE) is the model with the three predictors x1, x2and x4.

The model with the smallest AICc (or equivalently, the smallest
BIC) is the model with the two predictors x1 and x2.

38
8-2 Variable Selection: Best Subsets
SSE (Reduced)
Mallows Cp Cp (n 2 p)
MSE (Full )
MSE (Reduced)
(n p ) 1 p
MSE (Full )
Full: the full model including all the candidate predictor variables
Reduced: the model currently considered which involves
p unknown parameters
p : the number of unknown parameters (including the constant 0)
E[Cp] = p
A low Cp is desirable and models with the lowest Cp values are
the best.
To be a potentially good Cp, we should have Cp p.
The model with the lowest Cp is the model with the two
predictors x1, x2.
39
8-2 Variable Selection Cement Example
The stepwise regression procedure yields the final stepwise
model with x1 and x2 as predictors.
The best subsets regression procedure yields:
Based on the r2-value and MSE criteria, the "best" model has the
three predictors x1, x2 and x4.
Based on the AICc, BIC and Cp criterion, the best model has
the two predictors x1, x2.
Which model should we go with?
That's where the final step, the refining step, comes into play.
In the refining step, we evaluate each of the models identified by
the best subsets and stepwise procedures to see if there is a
reason to select one of the models over the other.
When selecting a final model, don't forget why you are
performing the research to begin with the reason may make
the choice of the model obvious. 40
8-2 Variable Selection Cement Example
With VIFs of 18.8 and 18.9, the
model exhibits substantial
multicollinearity. (Recall that the
predictors x2 and x4 are strongly
negatively correlated.)

Unless there is a good scientific


reason to go with the larger
model, it makes more sense to go
with the smaller, simpler model
containing the two predictors x1
and x2.
the VIFs are quite satisfactory
(both 1.1)
Moreover(next page) 41
5-2 Variable Selection Cement Example
Moreover, for the simpler model containing the two predictors
x1 and x2,
The r2-value (97.9%) is large.
The residual analysis yields no concerns.

42

Você também pode gostar