Escolar Documentos
Profissional Documentos
Cultura Documentos
Chapter Goals
After completing this chapter, you should be able to:
Calculate and interpret the simple correlation between two variables Determine whether the correlation is significant Calculate and interpret the simple linear regression equation for a set of data Understand the assumptions behind regression analysis
Chapter Goals
(continued)
x y y
x5
(continued)
Weak relationships
x y y
x6
(continued)
x y
x
7
Correlation Coefficient
(continued)
The population correlation coefficient (rho) measures the strength of the association between the variables The sample correlation coefficient r is an estimate of and is used to measure the strength of the linear relationship in the sample observations
Features of
and r
Unit free Range between -1 and 1 The closer to -1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship
9
r = -1
y
r = -.6
y
r=0
r = +.3
r = +1
10
r!
( x x)( y y ) [ ( x x ) ][ ( y y ) ]
2 2
r!
where:
n xy x y [n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable
11
Calculation Example
Tree Height y 35 49 27 33 60 21 45 51 7=321 Trunk Diameter x 8 9 7 6 13 7 11 12 7=73 xy 280 441 189 198 780 147 495 612 7=3142 y2 1225 2401 729 1089 3600 441 2025 2601 7=14111 x2 64 81 49 36 169 49 121 144 7=713
12
Calculation Example
Tree Height, y 70
60
(continued)
r!
n xy x y [n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ] 8(3142) (73)(321) [8(713) (73)2 ][8(14111) (321)2 ]
50
! ! 0.886
40
30
20
10
0 0 2 4 6 8 10 12 14
Trunk Diameter, x
13
Excel Output
Excel Correlation Output Tools / data analysis / correlation
Tree Height Trunk Diameter 1 0.886231 1
t!
r 1 r n2
2
15
t!
r 1 r n2
2
! 4.68
16
d.f. = 8-2 = 6
E/2=.025
E/2=.025
Reject H0
-t /2 -2.4469
Do not reject H0
t /2 2.4469
Reject H0
4.68
17
Dependent variable: the variable we wish to explain Independent variable: the variable used to explain the dependent variable
18
19
No Relationship
20
y!
x 1
Random Error component
21
Linear component
22
y!
x
Slope =
i
Predicted Value of y for xi Intercept =
xi
x
23
y i ! b0 b1x
! !
(y y)
(y (b
b1x))
25
( x x )( y y ) ! (x x)
2
algebraic equivalent:
x y xy
b1 ! n ( x ) 2 2 x n
and
b0 ! y b1 x
26
27
28
29
31
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
ANOVA
df Regression Residual Total 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
32
Graphical Presentation
House price model: scatter plot and regression line
450 400 House Price ($1000s) 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet
Slope = 0.10977
Intercept = 98.248
34
35
The simple regression line always passes through the mean of the y variable and the mean of the x variable The least squares coefficients are unbiased estimates of and
1 0
36
SST !
Total sum of Squares
SSE
Sum of Squares Error
SSR
Sum of Squares Regression
SST ! ( y y )2
where:
SSE ! ( y y )2
SSR ! ( y y )2
(continued)
Measures the variation of the yi values around their mean y SSE = error sum of squares Variation attributable to factors other than the relationship between x and y SSR = regression sum of squares Explained variation attributable to the relationship between x and y
38
(continued)
_ y
_ y
Xi
x
39
Coefficient of Determination, R2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable The coefficient of determination is also called Rsquared and is denoted as R2
SSR R ! SST
2
where
0 eR e1
40
Coefficient of Determination,(continued) R2
Coefficient of determination
SSR sum of squares explained by regression R ! ! SST total sum of squares
2
R !r
where:
R2 = 1 y
R2
= +1
x
42
x y
R2
=0
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
ANOVA
df Regression Residual Total 1 8 9 18934.9348 13665.5652 32600.5000
45
SSE sI ! n k 1
Where SSE = Sum of squares error n = Sample size k = number of independent variables in the model
46
sb1 !
!
2 2
s ( x) x n
2
(x x)
where:
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
s ! 41.33032
sb1 ! 0.03297
SS MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039 18934.9348 13665.5652 32600.5000
ANOVA
df Regression Residual Total 1 8 9
48
small sI
x y
small sb1
large s I
large sb1
49
=0 1{0
Test statistic
b1 t! sb1
where:
= Hypothesized slope
d.f. ! n 2
The slope of this model is 0.1098 Does square footage of the house affect its sales price?
b1
Standard Error 58.03348 0.03297
sb1
t Stat 1.69296 3.32938
t
P-value 0.12892 0.01039
d.f. = 10-2 = 8
E/2=.025
E/2=.025
Reject H0
-t /2 -2.3060
Do not reject H0
0 t /2 2.3060 3.329
Reject H
Decision: Reject H0 Conclusion: There is sufficient evidence that square footage affects 52 house price
b1 s t E/2 sb1
Excel Printout for House Prices:
Coefficients Intercept Square Feet 98.24833 0.10977 Standard Error 58.03348 0.03297 t Stat 1.69296 3.32938 P-value 0.12892 0.01039
d.f. = n - 2
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
53
Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size
This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance
54
y s t E/2s
1 (x p x) 2 n (x x)
55
y s t E/2s
1 (x p x) 1 2 n (x x)
This extra term adds to the interval width to reflect the added uncertainty for an individual case
56
xp
x
57
(continued)
y s t /2s
1 ! 317.85 s 37.12 2 n (x x)
(x p x)2
The confidence interval endpoints are 280.66 -- 354.90, or from $280,660 -- $354,900
60
y s t /2s
1 ! 317.85 s 102.28 1 2 n (x x)
(xp x)2
The prediction interval endpoints are 215.50 -- 420.07, or from $215,500 -- $420,070
61
62
Confidence Interval Estimate for E(y)|xp Prediction Interval Estimate for y|xp
63
Residual Analysis
Purposes
Examine for linearity assumption Examine for constant variance for all levels of x Evaluate normal distribution assumption
x
residuals residuals
Not Linear
Linear
65
x
residuals
x
residuals
x Non-constant variance
Constant variance
66
Excel Output
RESIDUAL OUTPUT Predicted House Price 1 2 3 4 5 6 7 8 9 10 251.92316 273.87671 284.85348 304.06284 218.99284 268.38832 356.20251 367.17929 254.6674 284.85348
1000
2000
3000