Escolar Documentos
Profissional Documentos
Cultura Documentos
13-1
Page
13-2
13-4
13-6
13-11
13-15
13-20
13-23
13-28
13-30
2007 A. Karpinski
Independent
Variable
Dependent
Variable
ANOVA
Categorical
Continuous
Regression
Continuous
or
Categorical
Continuous
Categorical Analysis
(Contingency Table Analysis)
Categorical
13-2
2007 A. Karpinski
13-3
ANOVA
REGRESSION
Dependent variable
Dependent variable or
Response variable or
Outcome variable
Independent variables
Independent variables or
Predictor variables
2007 A. Karpinski
Where
b = the y-intercept
a = the slope
X
3
-2
4
3
2
1
0
-1 -1 0
-2
2
x-axis
13-4
y-axis
y-axis
-2
4
3
2
1
0
-1 -1 0
-2
x-axis
2007 A. Karpinski
o Lets review the method covered in high school algebra for determining
the line that falls through 2 points: (1,1) & (5,1)
First, we compute the slope of the line
slope =
slope =
y 2 y1
x 2 x1
1 (1) 2
= = .333
5 (1) 6
-1
-1
0
-.667
1
-.333
2
0
3
.333
4
.667
5
1
1 y0
50
.333(5) = 1 y 0
y 0 = 1 1.667
y 0 = .667
Finally, we use the slope and the intercept to write the equation of the
line through the 2 points
Y = b + aX
Y = .667 + .333( X )
13-5
2007 A. Karpinski
13-6
2007 A. Karpinski
ei = Yi Yi
13-7
2007 A. Karpinski
X1
1
2
3
4
5
5
4
3
2
1
0
0
We notice that a straight line can be drawn that goes directly through
three of the 5 observed data points. Lets use this line as our best
guess line
~
Y = 1 + X
13-8
X
1
2
3
4
5
Y
0
1
2
3
4
e
1
0
0
-1
0
2007 A. Karpinski
(y
y ) = 0
(y
y i ) 2 = minimum
and
SSE
=0
b1
Where
b0 = Y 1 X
n
SS XX = ( X i X ) 2
i 1
n
SS XY = ( X i X )(Yi Y )
i 1
13-9
2007 A. Karpinski
Now, lets return to our example and examine the least squares regression
line
5
Eyeball Line
4
3
LS Line
2
1
0
0
Lets compare the least squares regression line to our eyeball regression line
~
Y = 1 + X
Y = .1 + .7 X
Data
e
e
Y
1
1
2
2
4
i
2
i
X
1
2
3
4
5
~
Y
0
1
2
3
4
Eyeball
e~
1
0
0
-1
0
0
e~ 2
2
0
0
2
0
Y
0.6
1.3
2.0
2.7
3.4
Least Squares
e
.4
-.3
0
-.7
.6
0
e 2
.16
.09
0
.49
.36
1.1
o For both models, we satisfy the condition that the residuals sum to zero
o But the least squares regression line produces the model with the smallest
squared residuals
Note that other regression lines are possible
o We could minimize the absolute value of the residuals
o We could minimize the shortest distance to the regression line
13-10
2007 A. Karpinski
b0 = Y b1 X
(
Var ( ) =
i
N 2
But we know = 0
Var( i ) =
( )
=
N 2
(Y
=
Yi )
N 2
SSResidual
N 2
13-11
2007 A. Karpinski
( )
Var( ) =
N 2
SSresid
= MSresid
N # of parameters
And we are justified using MSresid as the error term for tests
involving the regression model
o Interpreting MSresid:
Residuals measure deviation from regression line (the predicted
values)
The variance of the residuals captures the average squared deviation
from the regression line
So we can interpret MSresid as a measure of average deviation from
the regression line. SPSS labels MSresid as standard error of the
estimate
Now that we have an estimate of the error variance, we can proceed with
statistical tests of the model parameters
We can perform a t-test using our familiar t-test formula
t~
estimate
standard error of the estimate
o We know how to calculate the estimates of the slope and the intercept.
All we need are standard errors of the estimates
13-12
2007 A. Karpinski
b1
b1 t / 2,df *
MSresid
(X i X )2
o Conclusions
If the test is significant, then we conclude that there is a significant
linear relationship between X and Y
For every one-unit change in X, there is a b1 unit change in Y
If the test is not significant, then there is no significant linear
relationship between X and Y
Utilizing the linear relationship between X and Y does not
significantly improve our ability to predict Y, compared to using
the grand mean.
There may still exist a significant non-linear relationship between
X and Y
13-13
2007 A. Karpinski
X
N(X X )
2
i
b0
standard error (b0 )
X
N(X X )
2
i
13-14
2007 A. Karpinski
SST = (Yi Y ) 2
i =1
5
4
3
2
Mean(Y)
1
0
0
13-15
2007 A. Karpinski
The deviation of the regression best guess (the predicted value) from
the grand mean is the SS Regression.
n
SSReg = (Yi Y ) 2
i =1
5
4
LS Line
3
2
Mean(Y)
1
0
0
o The SS Residual is the amount of the total SS that we cannot predict from
the regression model
n
SSResid = (Yi Yi ) 2
i =1
5
4
LS Line
3
2
Mean(Y)
1
0
0
13-16
2007 A. Karpinski
i =1
i =1
SSresid = SSerror
SS
SSReg = (Yi Y ) 2
df
(# of parameters)
-1
SSResid = (Yi Yi ) 2
N
(# of parameters)
i =1
n
i =1
Total
SST = (Yi Y ) 2
MS
SSreg
df
SSresid
df
F
MSreg
MSresid
N-1
i =1
13-17
2007 A. Karpinski
o The Regression test examines all of the slope parameters in the model
simultaneously. Do these parameters significantly improve our ability to
predict Y, compared to using the grand mean to predict Y?
H 0 : b1 = b2 = ... = bk = 0
H 1 : Not all b j ' s = 0
o For simple linear regression, we only have one slope parameter. This test
becomes a test of the slope of b1
H 0 : b1 = 0
H 1 : b1 0
In other words, for simple linear regression, the Regression F-test will
be identical to the t-test of the b1 parameter
This relationship will not hold for multiple regression, when more
than one predictor is entered into the model
SST = (Yi Y ) 2
Var (Y ) =
i =1
SST
N 1
o The SS Regression is the part of the total variability that we can explain
using our regression line
o As a result, we can consider the following ratio, R 2 to be a measure of
the proportion of the sample variance in Y that is explained by X
R2 =
SSReg
SSTotal
R 2 is analogous to 2 in ANOVA
13-18
2007 A. Karpinski
SSReg
SSTotal
13-19
2007 A. Karpinski
6. An example
Predicting the amount of damage caused by a fire from the distance of the
fire from the nearest fire station
Distance from
Station
(Miles)
3.40
1.80
4.60
2.30
3.10
5.50
0.70
3.00
Fire Damage
(Thousands
of Dollars)
19.60
31.30
24.00
17.30
43.20
36.40
26.10
dollars
40.00
30.00
20.00
10.00
0.00
2.00
4.00
6.00
8.00
miles
13-20
2007 A. Karpinski
Variables Entered/Removedb
Model
1
Variables
Entered
MILESa
Variables
Removed
Method
Enter
Model Summary
Model
1
R
R Square
.961a
.923
Adjusted
R Square
.918
Std. Error of
the Estimate
2.31635
MSE
Here is our old friend
the ANOVA table
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
841.766
69.751
911.517
df
1
13
14
Mean Square
841.766
5.365
F
156.886
Coefficientsa
Model
1
(Constant)
MILES
Unstandardized
Coefficients
B
Std. Error
10.278
1.420
4.919
.393
Sig.
.000a
Standardized
Coefficients
Beta
.961
t
7.237
12.525
Sig.
.000
.000
13-21
2007 A. Karpinski
o From this table, we read that b0 = 10.278 and that b1 = 4.919 . Using this
information we can write the regression equation
Y = 10.278 + 4.919 * X
H 0 : b1 = 0
H 1 : b1 0
13-22
2007 A. Karpinski
Model
1
(Constant)
MILES
Unstandardized
Coefficients
B
Std. Error
10.278
1.420
4.919
.393
Standardi
zed
Coefficien
ts
Beta
.961
t
7.237
12.525
Sig.
.000
.000
b1 = 4.919
95% CI for b1 :
4.919 + 2.16(.393)
(4.07, 5.77)
For example, what is the average damage caused by (all) fires that are
5.8 miles from a fire station?
13-23
2007 A. Karpinski
o For a fire 5.8 miles from a station, we substitute X = 5.8 into the
regression equation
Y = 10.278 + 4.919 * 5.8
Y = 38.81
The difference in these two uses of the regression model lies in the accuracy
(variance) of our estimate of the prediction
Case I: Variance of the estimate the mean value of Y, Y , at X p
o When we attempt to estimate a mean value, there is one source of
variability: the variability due to the regression line
We know the equation of the regression line:
Y = b0 + b1 X
= Var (Y ) = MSE +
S XX
N
2
Y
13-24
2007 A. Karpinski
o And thus, the equation for the confidence interval of the estimate of the
mean value of Y, Y is
Y t / 2, N 2 Y
1 (X p X )2
Y t / 2, N 2 MSE +
S XX
N
Y
Confidence interval for mean value of Y
Prediction interval for a single value of Y
o The variance for the prediction interval of a single value must include
these two forms of variability
Y2i = 2 + Y2
2
1 (X p X )
= MSE 1 + +
S XX
N
2
Yi
13-25
2007 A. Karpinski
o And thus, the equation for the prediction interval of the estimate of the
mean value of Y, Y is
Y t / 2, N 2 Yi
2
1 (X p X )
Y t / 2, N 2 MSE 1 + +
S XX
o Ask SPSS to save the predicted value and the standard error of the
predicted value when you run the regression
REGRESSION
/MISSING LISTWISE
/DEPENDENT dollars
/METHOD=ENTER miles
/SAVE PRED (pred) SEPRED (sepred) .
13-26
2007 A. Karpinski
MILES
PRED
SEPRED
26.20
17.80
31.30
23.10
27.50
36.00
14.10
22.30
19.60
31.30
24.00
17.30
43.20
36.40
26.10
.
3.40
1.80
4.60
2.30
3.10
5.50
.70
3.00
2.60
4.30
2.10
1.10
6.10
4.80
3.80
5.80
27.00365
19.13272
32.90685
21.59239
25.52785
37.33425
13.72146
25.03592
23.06819
31.43105
20.60852
15.68919
40.28585
33.89072
28.97139
38.81005
.59993
.83401
.79149
.71122
.60224
1.05731
1.17663
.60810
.65500
.71985
.75662
1.04439
1.25871
.84503
.63199
1.15640
For X p = 5.8
Y = 38.81
Y = 1.156
Y t / 2,N 2 2 + Y2
2
13-27
2007 A. Karpinski
The regression line can be used for prediction and estimation, but not for
extrapolation
In other words, the regression line is only valid for Xs within the range of the
observed Xs
SPSS can be used to graph confidence intervals and prediction intervals
Prediction Bands
40.00
40.00
30.00
30.00
dollars
dollars
Confidence Bands
20.00
20.00
10.00
10.00
0.00
2.00
4.00
6.00
8.00
0.00
miles
2.00
4.00
6.00
8.00
miles
13-28
X
Y
2007 A. Karpinski
o Regress zY on zX
REGRESSION
/DEPENDENT zdollar
/METHOD=ENTER zmiles.
Coefficientsa
Model
1
(Constant)
ZMILES
Unstandardized
Coefficients
B
Std. Error
4.131E-04
.074
.961
.077
Standardi
zed
Coefficien
ts
Beta
.961
t
.006
12.525
Sig.
.996
.000
b1 = 1 = .961
Model
1
(Constant)
MILES
Unstandardized
Coefficients
B
Std. Error
10.278
1.420
4.919
.393
Standardi
zed
Coefficien
ts
Beta
.961
t
7.237
12.525
Sig.
.000
.000
1 = .961
13-29
2007 A. Karpinski
13-30
2007 A. Karpinski