Você está na página 1de 40

STATS 330: Lecture 9

Diagnostics

7.08.2014
Office hours
I Lecturers
Office auckland.ac.nz day time
Steffen Klaere 303.219 s.klaere Thu, 10:0012:00
Alan Lee 303S.265 aj.lee Tue, 10:3012:00
Thu, 10:3012:00

I Tutors (Room 303.326)


aucklanduni.ac.nz day time
Savannah Post spos008 Mon, 10:0012:00
Thu, 14:3015:30
Leshun Xu lxu472 Tue, 13:0014:00
Wed, 13:0014:00
Thu, 13:0014:00
Hongbin Guo hguo033 Tue, 11:0012:00
Wed, 14:0016:00
Thu, 10:0011:00
Fri, 11:0012:00
R-hint of the day
Plotting a fitted landscape
> x1 <- rnorm(1000)
> x2 <- rnorm(1000)
> y <- 1+2*x1-x2+rnorm(1000)
> fit.lm <- lm(y~x1+x2)
> x1seq <- seq(-5,5,length=30)
> x2seq <- seq(-5,5,length=30)
> z <- outer(X=x1seq,Y=x2seq,FUN=function(a,b)
{predict(fit.lm,newdata=data.frame(x1=a,x2=b))})
> jet.colors <- colorRampPalette(c("orange","blue"))
> color <- jet.colors(30)
> zfacet <- z[-1,-1]+z[-1,30]+z[-30,-1]+z[-30,-30]
> facetcol <- cut(zfacet,30)
> res <- persp(x1seq,x2seq,z,theta=-50,col=color[facetcol],
expand=.7,xlab="x1",ylab="x2",zlab="z",
main=paste("Dataset A, correlation",round(cor(x1,x2),3)))
> no.points <- trans3d(x1,x2,y,pmat=res)
> points(no.points,pch=21,bg="tomato")
R-hint of the day
Dataset A, correlation 0.01

x2 x1
Aims of the next four lectures

I To give you an overview of the modelling cycle.

I To have a detailed discussion of diagnostic


procedures.
The modelling cycle

I We have seen that the regression model describes rather


specialised forms of data

I Data are planar,


I Scatter is uniform over the plane.

I We have looked at some plots that help us decide if the data


is suitable for regression modelling

I pairs
I reg3d
I coplot
Residual analysis

I Another approach is to fit the model and examine the


residuals.

I If the model is appropriate the residuals have no pattern

I A pattern in the residuals usually indicates that the model is


not appropriate

I If this is the case we have two options

1. Select another form of model e.g. non-linear regression see


other courses;

2. Transform the data so that the regression model fits the


transformed data see next slide.
The Modelling Cycle

PLOTS and THEORY

Choose Model

Fit Model

Transform Examine Residuals

Bad fit Good fit

USE MODEL
What constitutes a bad fit?

I Non-planar data: Seriously affects meaning and accuracy of


estimated coefficients
I Outliers in the data: Seriously affects meaning and accuracy
of estimated coefficients
I Non-constant scatter: Affects standard error of estimate
I Errors not independent: Affects standard error of estimate
I Errors not normally distributed: Affects standard error of
estimate
Diagonstic steps

We test for <> using <>:


Planarity: ...

Constant Variance: ...

Outliers: ...

Independence: ...

Normality of Errors: ...


Detecting non-planar data

I We can diagnose non-planar data (non-linearity) by fitting the


model, and

I plotting residuals versus fitted values;


I residuals against explanatory variables;
I fitting additive models

I In each case, a curved plot indicates non-planar data.


Plotting residuals vs. fitted values

> data(cherry.df)
> cherry.lm <- lm(volume~diameter+height,data=cherry.df)
> plot(cherry.lm,which=1)

which=1: selects the plot of residuals vs. fitted values


Plotting residuals vs. fitted values

Residuals vs Fitted

10
31

2


5


Residuals

18

10 20 30 40 50 60 70

Fitted values
lm(volume ~ diameter + height)
Additive models

I These are models of the form

Y = g1 (x1 ) + g2 (x2 ) + + gk (xk ) +

where g1 , . . . , gk are transformations.

I Fitted using the gam function in R.

I The transformations are estimated by the software.

I Use the function to suggest good transformations.


Example: Cherry trees

> library(mgcv)
> cherry.gam <- gam(volume~s(diameter)+s(height),
data=cherry.df)
> plot(cherry.gam,residuals=T,pages=1)
Example: Cherry trees

40

40
30

30
s(diameter,2.69)

s(height,1)
20

20
10

10
0

0
20

20
8 10 12 14 16 18 20 65 70 75 80 85

diameter height
Fitting polynomials

I To fit a model y = 0 + 1 x + 2 x 2 , use

y~poly(x,2)

I To fit a model y = 0 + 1 x + 2 x 2 + 3 x 3 , use

y~poly(x,3)

etc.
Orthogonal polynomials

I The model fitted by y~poly(x,2) is of the form

Y = 0 + 1 p1 (x) + 2 p2 (x)

where

p1 : polynomial of degree 1, i.e. of the form a0 + a1 x

p2 : polynomail of degree 2, i.e. of the form b0 + b1 x + b2 x 2 .

I p1 , p2 chosen to be uncorrelated (best possible estimation)


Adding a quadratic term: Cherry trees

Call:
lm(formula = volume ~ poly(diameter, 2) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56553 6.72218 0.233 0.817603
poly(diameter, 2)1 80.25223 3.07346 26.111 < 2e-16 ***
poly(diameter, 2)2 15.39923 2.63157 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Residual standard error: 2.625 on 27 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Quadratic equation

Call:
lm(formula = volume ~ diameter + I(diameter^2) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.92041 10.07911 -0.984 0.333729
diameter -2.88508 1.30985 -2.203 0.036343 *
I(diameter^2) 0.26862 0.04590 5.852 3.13e-06 ***
height 0.37639 0.08823 4.266 0.000218 ***
---
Residual standard error: 2.625 on 27 degrees of freedom
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Quadratic equation

volume 9.92 2.89 diameter + 0.27 diameter2 + 0.38 height

volume

er
he et
igh am
t di
Splines

I An alternative to polynomials are splines these are piecewise


cubics, which join smoothly at knots.

I Give a more flexible fit to the data.

I Values at one point not affected by values at distant points,


unlike polynomials
Example with 4 knots

1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
Cherry splines

Call:
lm(formula = volume ~ bs(diameter, knots = knot.points) + height,
data = cherry.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.3679 7.4856 -2.187 0.03921 *
bs(diameter, knots = knot.points)1 0.1941 7.9374 0.024 0.98070
bs(diameter, knots = knot.points)2 5.5744 3.1704 1.758 0.09201 .
bs(diameter, knots = knot.points)3 10.7976 3.9798 2.713 0.01240 *
bs(diameter, knots = knot.points)4 31.4053 5.5545 5.654 9.35e-06 ***
bs(diameter, knots = knot.points)5 42.2665 6.1297 6.895 4.97e-07 ***
bs(diameter, knots = knot.points)6 58.6454 4.2781 13.708 1.49e-12 ***
height 0.3970 0.1050 3.780 0.00097 ***
---
Residual standard error: 2.8 on 23 degrees of freedom
Multiple R-squared: 0.9778, Adjusted R-squared: 0.971
F-statistic: 144.4 on 7 and 23 DF, p-value: < 2.2e-16
Cherry splines

Basis for quadratic splines


1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
Cherry splines

Basis for cubic splines


1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
Cherry splines
80

polynomial
splines
70
60



50
Volume


40





30







20





10

8 10 12 14 16 18 20

Diameter
Example: Tyre abrasion data

I Data collected in an experiment to study the abrasion


resistance of tyres

I Variables are

Hardness: Hardness of rubber

Tensile: Tensile strength of rubber

Abrasion Loss: Amount of rubber worn away in a standard


test (response)
Tyre abrasion data

Call:
lm(formula = abloss ~ hardness + tensile, data = rubber.df)
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
hardness -6.5708 0.5832 -11.267 1.03e-11 ***
tensile -1.3743 0.1943 -7.073 1.32e-07 ***
---
Residual standard error: 36.49 on 27 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
Tyre abrasion data

I We will use this example to illustrate all the methods we have


discussed so far to check if the data are planar, scattered
about a flat regression plane i.e.

I Pairs plot
I Spinning plot
I Coplot
I Residual vs. fitted value plot
I Fitting GAMs
Pairs plot Not very informative
120 140 160 180 200 220 240

90


hardness





80

70




60





50

120 140 160 180 200 220 240




tensile




0.30


abloss

300
0.74 0.30

200
50 100
50 60 70 80 90 50 100 200 300
Spinning Hint of a kink
Coplot Suggestion of non-linearity
Given : hardness
50 60 70 80

120 140 160 180 200 220 240 120 140 160 180 200 220 240

350

250


150






50

abloss


350


250








150




50

120 140 160 180 200 220 240

tensile
Residuals vs. fitted values weak suggestion of
non-planarity

Residuals vs Fitted

29


50









Residuals


50

22
10

50 100 150 200 250 300 350

Fitted values
GAMs Quite strong indication of non-planarity

hardness looks okay, but tensile needs transformation.


50 100

50 100
s(tensile,5.42)

s(hardness,1)
0

0
100

100
120 140 160 180 200 220 240 50 60 70 80 90

tensile hardness
Fitting a fourth degree polynomials

> rubber.poly <- lm(abloss~hardness+tensile+I(tensile^2)


+I(tensile^3)+I(tensile^4),data=rubber.df)
> summary(rubber.poly)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.862e+04 4.177e+03 -4.458 0.000165 ***
hardness -6.261e+00 4.124e-01 -15.182 8.35e-14 ***
tensile 4.414e+02 9.836e+01 4.487 0.000153 ***
I(tensile^2) -3.693e+00 8.546e-01 -4.321 0.000233 ***
I(tensile^3) 1.342e-02 3.246e-03 4.133 0.000377 ***
I(tensile^4) -1.794e-05 4.553e-06 -3.940 0.000613 ***
---
Residual standard error: 23.25 on 24 degrees of freedom
Multiple R-squared: 0.9423, Adjusted R-squared: 0.9303
F-statistic: 78.46 on 5 and 24 DF, p-value: 4.504e-14
Fitting splines

> rubber.bs <- lm(abloss~hardness+bs(tensile,df=4),


data=rubber.df)
> summary(rubber.bs)
--
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 612.1556 43.0348 14.225 3.43e-13 ***
hardness -6.1914 0.4139 -14.959 1.15e-13 ***
bs(tensile, df = 4)1 195.5549 40.6339 4.813 6.69e-05 ***
bs(tensile, df = 4)2 -148.3497 38.6717 -3.836 0.000796 ***
bs(tensile, df = 4)3 -24.2971 37.7010 -0.644 0.525385
bs(tensile, df = 4)4 -61.0593 25.4829 -2.396 0.024720 *
---
Residual standard error: 23.36 on 24 degrees of freedom
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9297
F-statistic: 77.7 on 5 and 24 DF, p-value: 5.021e-14
Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Residuals vs. covariats,
added variable plots, GAM plots

Constant Variance: ...

Outliers: ...

Independence: ...

Normality of Errors: ...


http://xkcd.com/1252/

Você também pode gostar