Build The MLR Model-Diagnostics & Model Selection

?
Lecture 8. Build the MLR:

Diagnostics & Model Selection
IENG 314
Feng Yang
West Virginia University
1
8-1 Diagnostics: Residual Analysis
As in Lecture 5, many of the regression diagnostics are based on a
graphical examination of the residuals. Thus, it is useful to derive
some properties of the vectors of residuals and fitted values.
with H X(X' X)-1 X'
The hat matrix H is an nn matrix, and has elements hij ; we have

n
y i hij y j (i 1,2,..., n)
j 1
n
Using linear algebra, it can be shown that h

i 1
ii p 1
2
The residual vector is given by: e y - y y - Hy (I - H)y
where I is the identity matrix of order n.
It can be shown that the ei are individually normally distributed

with zero means and
Var [ei ] 2 (1 hii )
Therefore, SE[ei ] 1 hii
Standardized residuals are given by:

ei ei ei
ei
*

SE[ei ] 1 hii
If | ei | 2 , the corresponding observation may be regarded an
*
outlier.
3
Similar residual plots as those discussed in Lecture 5 can be made
to evaluate the goodness of fit of the model and to check if the
underlying assumptions are met.
Checking model assumptions
(1) Checking for the adequacy of the functional

form of the model
(2) Checking for constant variance
(3) Checking for normality
(4) Checking for independence
4
Residual plots for regression diagnostics
(1) Plots of the residuals against individual predictor variables:
These are used to check linearity w.r.t. the xjs.
Each plot should be random
A systematic pattern indicates the necessity for adding
nonlinear terms in the corresponding xj to the model.
(2) A plot of the residuals against fitted value: This is used to
check the assumption of constant variance.
Variability in the residuals ei should be roughly constant
over the range of the fitted response values.
A systematic pattern indicates the necessity for adding
nonlinear terms in the corresponding xj to the model.
5
Residual plots for regression diagnostics
(3) A normal plot of the residuals: This is used to check the
normality assumption.
(4) A run chart of the residuals: If the data are obtained
sequentially over time, this plot should be used to check
whether the random errors are autocorrelated and/or if there
are any time trends.
(5) Plots of the residuals against any omitted predictor
variables
This is used to check if any of the omitted predictor
variables should be included in the model. If a plot
shows a systematic pattern, then inclusion of that
omitted variable in the model is indicated.
6
8-1 Diagnostics: Data Transformation
As discussed in Lecture 5, transformations of the variables (both
y and xs) are often necessary to satisfy the assumptions of
linearity, normality, and constant error variance.
Many seemingly nonlinear models can be written in the multiple
linear regression model form after making a suitable
transformation. For example, the model
y 0 x11 x2 2
where () are unknown parameters, can be written in
the multiple linear regression model form if we take logarithms
of both sides. This results in the equation
log y log 0 1 log x1 2 log x2
7
8-1 Diagnostics: Multicollinearity
Multicollinearity is a statistical phenomenon in which two or
more columns of the X matrix are highly correlated.
This suggests redundant columns(or variables)
The extra columns (or variables) need to be deleted from X (or
the model) to eliminate multicollinearity
For example, consider the following predictor variables:

income, expenditure, and saving
Linear dependence: saving = income expenditure
Thus, only two of the three variables should be included in the
regression as predictor variables
percentages of different ingredients used in a product
Linear dependence: All the ingredient percentages add up to 100%
Thus, one of the ingredient percentages should not be included in the
regression.
8
Multicollinearity can cause serious numerical and statistical
difficulties in multiple linear regression
The least squares estimate is ( X' X)-1 X' y
XX must be invertible (i.e., nonsingular) in order for to be
unique and computable.
If the columns of X are approximately linearly dependent, then
XX is nearly singular, which makes numerically unstable.
Most of the estimated coefficients have very large standard
errors, and as a result are statistically non-significant even if
the overall F-statistic is significant.
This is because the matrix V=(XX)-1 has very large
elements, and hence
Var [ j ] 2V jj
are large. 9
Example: Cement i x1 x2 x3 x4 y
1 7 26 6 60 78.5
This table gives data on 2 1 29 15 52 74.3
the heat evolved in 3 11 56 8 20 104.3
calories during hardening 4 11 31 8 47 87.6
of cement on a per gram 5 7 52 6 33 95.9
basis (y) along with the
6 11 55 9 22 109.2
percentages of four
7 3 71 17 6 102.7
ingredients: tricalcium
8 1 31 22 44 72.5
aluminate (x1), tricalcium
9 2 54 18 22 93.1
silicate (x2), tetracalcium
10 21 47 4 26 115.9
alumino ferrite (x3), and
11 1 40 23 34 83.8
dicalcium silicate (x4).
12 11 66 9 12 113.3
13 10 68 8 12 109.4
10
Example (contd)
First, note that x1, x2, x3, and x4 add up to approximately 100% for all
observations. This approximate linear relationship among the xs results
in multicollinearity.
Test the multicollinearity in JMP:
11
Example (contd)
The correlation matrix among all four predictor variables is as follows.
12
Variance Inflation Factors (VIF):
A direct measure of multicollinearity.
1
VIF j , j 1,2,..., p
1 rj
2
r j2 : the coefficient of multiple determination (r squared) when

regressing xj on the remaining p-1 predictor variables.
If xj is approximately linearly dependent on other predictor

2
r
variables, then j will be close to 1, and VIFj will be large.
Generally,
VIFj >10 corresponding to j 0.9
2
r
are regarded as unacceptable.
13
Example: JMP Regression
JMPAnalyzeFit model:
14
Example: JMP Regression
The multicollinearity exits

(all VIFs are greater than 10).
All the estimated coefficients
are nonsignificant (all the
coefficient have p-values
>0.05), although the overall
F=111.48 is highly
significant.
We will see later that the
best model includes only x1
and x2
15
8-2 Variable Selection
How to decide which variables should be included in the regression?
In practice, a common problem is that there is a large set of
candidate variables
We want to select a small subset of variables to perform
regression.
16
8-2 Variable Selection
Thus, we need methods to systematically select the best subset of
variables
Stepwise regression
Best subsets regression
How to decide which variables should be included in the regression?

The goal of variable selection is to choose a small subset of
predictors from the larger set of candidate predictors so that the
resulting regression model is simple yet useful.
That is, as always, our resulting regression model should:
provide a good summary of the trend in the response, and/or
provide good predictions of the response, and/or
provide good estimates of the slope coefficients.
17
8-2 Variable Selection: Stepwise Regression
The general idea behind stepwise regression is that we build
our model from a set of candidate predictor variables by
entering and removing predictors in a stepwise manner
into our model until there is no justifiable reason to enter or
remove any more.
This is done by taking into account the marginal contribution
of each variable to the model given the contribution of the
other variables already present in the model.
Review Slides 29-47, Lecture 7.
We need methods to systematically select the best subset of
variables
18
Overview of the stepwise regression procedure
First, we start with no predictors in our "stepwise model."
Then, at each step along the way we either enter or remove a
predictor based on the partial F-tests that is, the t-tests for
the slope parameters that are obtained.
We stop when no more predictors can be justifiably entered
or removed from our stepwise model, thereby leading us to a
"final model."
19
Stepwise regression procedure
Starting the procedure.
Set a significance level for deciding when to enter a predictor into the
stepwise model: Alpha-to-Enter denoted as E.
JMP default: E = 0.25
Also set a significance level for deciding when to remove a predictor from
the stepwise model: Alpha-to-Remove denoted as R.
JMP default: R = 0.25
Step #1.
Fit each of the one-predictor models: regress y on x1, regress y on x2, ...,
and regress y on xp.
Of those predictors whose t-test P-value is less than E, the 1st predictor
put in the stepwise model is the predictor that has the smallest t-test P-
value.
If no predictor has a t-test P-value less than E, stop.
20
Step #2.
Suppose in Step #1, x1 had the smallest t-test P-value below E and
therefore was deemed the "best" one predictor arising from Step #1.
1)Now, fit each of the two-predictor models that include x1 as a
predictor: regress y on x1and x2, regress y on x1and x3, ..., and regress
y on x1 and xp
2)Of those predictors whose t-test P-value is less than E, the 2nd
predictor put in the stepwise model is the predictor that has the
smallest t-test P-value.
3)If no predictor has a t-test P-value less than E, stop. The model
with the one predictor obtained from the first step is your final
model. Otherwise, continue. Suppose that x2 was deemed the "best"
2nd predictor and it is therefore entered into the stepwise model.
21
Step #2 (contd).
4) Since x1 was the 1st predictor in the model, step back and see if entering
x2 into the stepwise model somehow affected the significance of the x1
predictor. That is, check the t-test P-value for testing 1 = 0. If the t-test P-
value for 1 = 0 has become not significant that is, the P-value is greater
than R remove x1 from the stepwise model.
Step #3.
Suppose both x1 and x2 made it into the two-predictor stepwise model.
1)Now, fit each of the three-predictor models that include x1 and x2 as
predictors: that is, regress y on x1, x2, and x3, regress y on x1, x2, and x4, ...,
and regress y on x1, x2, and xp.
2)Of those predictors whose t-test P-value is less than E, the 3rd
predictor put in the stepwise model is the predictor that has the smallest
t-test P-value. 22
Step #3 (contd).
3) If no predictor has a t-test P-value less than E , stop. The model
containing the two predictors obtained from the second step is your final
model. Otherwise, continue and suppose that x3 was deemed the "best" 3rd
predictor and it is therefore entered into the stepwise model.
4) Now, since x1 and x2 were the first predictors in the model, step back
and see if entering x3 into the stepwise model somehow affected the
significance of the x1 and x2 predictors. That is, check the t-test P-values
for testing 1 = 0 and 2 = 0. If the t-test P-value for either 1 = 0 or 2 = 0
has become not significant that is, the P-value is greater than R
remove
Stoppingthethe
predictor from the
procedure. stepwise
Continue themodel.
steps as described above
until adding an additional predictor does not yield a t-test P-value
below E .
23
Cement Example The scatter plot matrix of the data
From the plots, we can get a hunch of which predictors are

good candidates for being the first to enter the stepwise model.
24
Cement Example
To start our stepwise regression procedure, let's set our Alpha-to-Enter at
E = 0.15, and let's set our Alpha-to-Remove at R = 0.15.
Step #1: regressing y on x1, regressing y on x2, regressing y on x3, and
regressing y on x4, we obtain:
Smallest p value
( or largest |T| )
25
Cement Example
Step #1 (contd): Thus, we enter x4 into our stepwise model.
Step #2: We fit each of the two-predictor models that include x4 as a
predictor: we regress y on x4 and x1, regress y on x4 and x2, and regress y
on x4 and x3, obtaining:
Smallest p value
( or largest |T| )
26
Cement Example
Step #2 (contd):
As a result of Step #2, we enter x1 into our stepwise model.
Now, since x4 was the first predictor in the model, we must step
back and see if entering x1 into the stepwise model affected the
significance of the x4 predictor. It did not the t-test P-value for
testing 1 = 0 is less than 0.0001, smaller than R = 0.15. Therefore,
we proceed to the third step with both x1 and x4 as predictors in our
stepwise model
Step #3:
We fit each of the three-predictor models that include x1 and
x4 as predictors that is, we regress y on x4, x1, and x2; and
we regress y on x4, x1, and x3, obtaining:
27
Cement Example
Smallest p value
( or largest |T| )
Step #3 (contd):
As a result of the third step, we enter x2 into our stepwise model.
Since x1 and x4 were the first predictors in the model, we need to see if
entering x2 into the stepwise model affected the significance of x1 and x4.
Indeed, it did the t-test P-value for testing 4 = 0 is 0.2054, greater than
R = 0.15. Thus, we remove x4 from the stepwise model, leaving us with
28
the predictors x1 and x2 in our stepwise model:
Cement Example
Repeat Step #3: Now, we proceed fitting each of the 3-predictor models
that include x1 and x2 as predictors that is, we regress y on x1, x2, and
x3; and we regress y on x1, x2, and x4, obtaining:
29
Cement Example
Neither of the remaining predictors x3 and x4 are eligible for entry
into our stepwise model, because each t-test P-value 0.209 and 0.205,
respectively is greater than E = 0.15.
Thus, we stop our stepwise regression procedure. Our final regression
model, based on the stepwise procedure contains only the predictors x1
and x2:
30
Stepwise regression in JMP: Cement Example
JMP
stepwise
regression
output:
31
Stepwise regression in JMP: Cement Example
JMP output explanation:
Note that JMP considers a step any addition or removal of a
predictor from the stepwise model
JMP Step 1: The predictor x4 is entered into the stepwise model.
JMP Step 2: The predictor x1 is entered into the stepwise model
already containing the predictor x4.
JMP Step 3: The predictor x2 is entered into the stepwise model
already containing the predictors x1 and x4.
JMP Step 4: the predictor x4 is removed from the stepwise
model containing the predictors x1, x2, and x4, leaving us with
the final model containing only the predictors x1 and x2.
32
The final model is not guaranteed to be optimal in any specified
sense.
The procedure yields a single final model, although there are often
several equally good models.
Stepwise regression does not take into account a researcher's

knowledge about the predictors. It may be necessary to force the
procedure to include important predictors.
One should not jump to the conclusion that all the important
predictor variables for predicting y have been identified, or that all
the unimportant predictor variables have been eliminated.
It's for all of these reasons that one should be careful not to overuse or
overstate the results of any stepwise regression procedure. 33
8-2 Variable Selection: Best Subsets
The general idea behind best subsets regression is that we select the
subset of predictors that do the best at meeting some well-defined
objective criterion, such as having the largest R2 value or the smallest
MSE.
The procedure of best subsets regression
Step #1. Identify all of the possible regression models derived from
all of the possible combinations of the candidate predictors.
e.g., suppose we have three 3 candidate predictors: x1, x2, and x3,
then there are 8 possible regression models we can consider:
the one model with no predictors
the three models with only one predictor each
the three models with two predictors each
the one model with all three predictors
In general, if there are p candidate predictors, then there are 2p
possible regression models. 34
Step #2. From the possible models identified in Step #1, find
the one-predictor models that do the "best" at meeting some well-
defined criteria,
the two-predictor models that do the "best,"
the three-predictor models that do the "best," and so on.
It cuts down considerably the number of models to consider!
But what do we mean by "best"?
We might not be able to agree on what's best!
In thinking about what "best" means, you might have thought of
any of the following:
the model with the largest r-squared
the model with the smallest MSE
the model with the smallest AICc and
BIC(not required in this course)
The other criteria : Mallow's Cp 35
Step #3. Further evaluate and refine the handful of models identified
in the last step. This might entail
performing residual analyses
transforming the predictors and/or response
including additional predictor variables as candidates
considering other factors such as simplicity of the models (# of
variables included), ease of observing/controlling the variables
included in the models
Do this until you are satisfied that you have found a model that
does a good job of summarizing the trend in the data
and most importantly, allows you to address your practical concerns
36
SSE
r 1
2
SST
r2 can only increase as more variables are added. Thus, it makes no

sense to define the "best" model as the model with the largest r2-value.
JMP highlights the best model for each number of predictors based on
the size of the r2-value; that is, JMP highlights the one one-predictor
models with the largest r2-value, followed by the one two-predictor
models with the largest r2-value, and so on.
37
The r2-value increases if MSE decreases.
That is, the r2-value and MSE criteria always yield the same
"best" models.
In JMP outputs: RMSE MSE
The model with the largest r2-value (or equivalently, the smallest
RMSE) is the model with the three predictors x1, x2and x4.
The model with the smallest AICc (or equivalently, the smallest
BIC) is the model with the two predictors x1 and x2.
38
SSE (Reduced)
Mallows Cp Cp (n 2 p)
MSE (Full )
MSE (Reduced)
(n p ) 1 p
MSE (Full )
Full: the full model including all the candidate predictor variables
Reduced: the model currently considered which involves
p unknown parameters
p : the number of unknown parameters (including the constant 0)
E[Cp] = p
A low Cp is desirable and models with the lowest Cp values are
the best.
To be a potentially good Cp, we should have Cp p.
The model with the lowest Cp is the model with the two
predictors x1, x2.
39
8-2 Variable Selection Cement Example
The stepwise regression procedure yields the final stepwise
model with x1 and x2 as predictors.
The best subsets regression procedure yields:
Based on the r2-value and MSE criteria, the "best" model has the
three predictors x1, x2 and x4.
Based on the AICc, BIC and Cp criterion, the best model has
the two predictors x1, x2.
Which model should we go with?
That's where the final step, the refining step, comes into play.
In the refining step, we evaluate each of the models identified by
the best subsets and stepwise procedures to see if there is a
reason to select one of the models over the other.
When selecting a final model, don't forget why you are
performing the research to begin with the reason may make
the choice of the model obvious. 40
With VIFs of 18.8 and 18.9, the
model exhibits substantial
multicollinearity. (Recall that the
predictors x2 and x4 are strongly
negatively correlated.)
Unless there is a good scientific

reason to go with the larger
model, it makes more sense to go
with the smaller, simpler model
containing the two predictors x1
and x2.
the VIFs are quite satisfactory
(both 1.1)
Moreover(next page) 41
Moreover, for the simpler model containing the two predictors
x1 and x2,
The r2-value (97.9%) is large.
The residual analysis yields no concerns.
42

Build The MLR Model-Diagnostics & Model Selection

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Build The MLR Model-Diagnostics & Model Selection

Enviado por

Direitos autorais:

Formatos disponíveis

?

Lecture 8. Build the MLR:

with H X(X' X)-1 X'

The hat matrix H is an nn matrix, and has elements hij ; we have

Using linear algebra, it can be shown that h

It can be shown that the ei are individually normally distributed

Therefore, SE[ei ] 1 hii

Standardized residuals are given by:

Checking model assumptions

(1) Checking for the adequacy of the functional

For example, consider the following predictor variables:

Test the multicollinearity in JMP:

r j2 : the coefficient of multiple determination (r squared) when

If xj is approximately linearly dependent on other predictor

The multicollinearity exits

How to decide which variables should be included in the regression?

From the plots, we can get a hunch of which predictors are

Stepwise regression does not take into account a researcher's

r2 can only increase as more variables are added. Thus, it makes no

Unless there is a good scientific

Você também pode gostar