Você está na página 1de 12

INFERENCE FOR REGRESSION PART 2

Topics Outline
Analysis of Variance (ANOVA)
F Test for the Slope
Calculating Confidence and Prediction Intervals
Normal Probability Plots
Analysis of Variance (ANOVA)
Question: If the least squares fit is the best fit, how good is it?
The answer to this question depends on the variability in the values of the response variable,
that is on the deviations of the observed ys from their mean y :

Figure 1 Decomposition of total variation


The total variation yi

y for an observed y i can be decomposed into two parts:

total

regression

error

yi y
total
variation

y i y
variation explained
by the regression line

y i y i
unexplained
variation

If we square these deviations and add them up, we obtain the following three sources of
variability:
-1-

Total Sum
of Squares

Regression Sum
of Squares

( yi

y) 2

i 1

Error Sum
of Squares

( y i

y) 2

( yi

+
i 1

i 1

SST

y i ) 2

SSR

SSE

If we divide SST by n 1, we get the sample variance of the observations y1 , y 2 ,, y n :

s y2

1
n 1i

yi

n 1

SST

So, the total sum of squares SST really is a measure of total variation.
It has n 1 degrees of freedom.
The regression sum of squares SSR represents variation due to the relationship between x and y.
It has 1 degree of freedom.
The error sum of squares SSE measures the amount of variability in the response variable due
to factors other than the relationship between x and y.
It has n 2 degrees of freedom.
For the car plant electricity usage example (see Excel output on pages 5 and 6),
SST

SSR

1.5115 = 1.2124 +

SSE
0.2991

By themselves, SSR, SSE, and SST provide little that can be directly interpreted.
However, a simple ratio of the regression sum of squares SSR to the total sum of squares SST
provides a measure of the goodness of fit for the estimated regression equation. This ratio is the
coefficient of determination:
SSR
SSE
r2
1
SST
SST
The coefficient of determination is the proportion of the total sum of squares that can be
explained by the sum of squares due to regression. In other words, r 2 measures the proportion of
variation in the response variable y that can be explained by ys linear dependence on x in the
regression model.
For our data,
r2

SSR
SST

1.2124
1.5115

0.802

This high proportion of explained variation indicates that the estimated regression equation
provides a good fit and can be very useful for predictions of electricity usage.
-2-

If we divide the sum of squares for regression and error by their degrees of freedom,
we obtain the regression mean square MSR and the error mean square MSE:
MSR

MSE

SSR
SSR (variance due to regression)
1
SSE
(variance due to error)
n 2

Taking a square root of the variance due to error we obtain the regression standard error
or standard error of estimate:

se
n

y i

yi

Note: Recall that s e

MSE

SSE
n 2

i 1

is the standard deviation of the residuals and it estimates


n 2
the standard deviation
of the errors in the model.

For our example,


MSR

SSR
1.2124
1

MSE

SSE
n 2

0.2991
12 2

0.02991

F Test for the Slope


The ratio of the mean squares provides the F-statistic
F

MSR
MSE

MSR
MSE

For our example,


1.2124
0.02991

40.53

It can be proved that the F-statistic follows an F-distribution with 1 and (n 2) degrees of
freedom and can be used to test the hypothesis for a linear relationship between x and y.

-3-

Recall that

represents the slope of the true unknown regression line


E ( y)

The null and alternative hypotheses for the slope

are stated as follows:

H0 :

Ha :

If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.
For our example, the corresponding F distribution has df1 = 1 and df2 = n 2 = 12 2 = 10
degrees of freedom. To calculate the P-value associated with the value F = 40.53 of the test
statistic, Excel can be used as follows:
P-value = FDIST(test statistic,df1,df2) = FDIST(40.53,1,10) = 0.000082

Figure 2 Testing for significance of slope using F distribution with 1 and 10 degrees of freedom
With P-value this small, we reject the null hypothesis and conclude that there is a significant
linear relationship between the electricity usage and the production levels.
Notice that the P-value = 0.000082 for the F test of the slope is the same as the P-value for the
t test of the slope performed earlier. Moreover, it can be shown that the square of t distribution
with n 2 degrees of freedom equals the F distribution with 1 and n 2 degrees of freedom:
t n2

F1,n

For our example,


t

b
SEb

0.4988301
0.078352

6.3665267

40.53

With only one explanatory variable, the F test will provide the same conclusion as the t test.
But with more than one explanatory variable, only the F test can be used to test for an overall
significant relationship.
-4-

Example 1 (Car plant electricity usage)


The manager of a car plant wishes to investigate how the plants electricity usage depends upon
the plants production, based on the data for each month of the previous year:
x
y
Month
Production Electricity usage
January
4.51
2.48
February
3.58
2.26
March
4.31
2.47
April
5.06
2.77
May
5.64
2.99
June
4.99
3.05
July
5.29
3.18
August
5.83
3.46
September 4.70
3.03
October
5.61
3.26
November 4.90
2.67
December 4.20
2.53

Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

Regression
Residual
Total

Intercept
Production

df
1
10
11

0.895606
0.802109
0.782320
0.172948
12

SS
1.212382
0.299110
1.511492

Electricity usage (million kWh)

Car Plant Electricity Usage


3.5
y = 0.4988x + 0.409
R = 0.8021

3.25
3
2.75
2.5
2.25
2
3.5

4.5

5.5

Production ($ million)

MS
F
Significance F
1.212382 40.532970 0.000082
0.029911

Coefficients Standard Error


0.409048
0.385991
0.498830
0.078352

t Stat
1.059736
6.366551

-5-

P-value
Lower 95%
0.314190 -0.450992
0.000082 0.324252

Upper 95%
1.269089
0.673409

Interpretation of Excels Regression Output


Regression Statistics
Multiple R
0.895606 (sign of b) r 2
SSR
SSE
1
R Square
0.802109 r 2
SST
SST
SSE
2
Adjusted R Square 0.782320 radj 1 n 2
SST
n 1
SSE
Standard Error
0.172948 se
MSE
n 2
Observations
12 n
ANOVA
Source of
Variation

Degrees
Sum
of
of Squares
Freedom

Mean Square
(Variance)

Regression

SSR

MSR

Error

n 2

SSE

MSE

Total

n 1

SST

Regression
Residual
Total

df
1
10
11

SS
1.212382
0.299110
1.511492

Coefficients
y-intercept

SSR
1
SSE
n 2

MS
1.212382
0.029911

Std Error t Stat


a
SEa
SE a
b
SEb
SE b

Coefficients Std Error


Intercept
0.409048
0.385991
Production 0.498830
0.078352

F-statistic

P-value

MSR
MSE

Prob > F

F
40.532970

Significance F
0.000082

P-value

Lower 95%

Upper 95%

Prob>| t |

a t*SEa

a + t*SEa

Prob>| t |

b t*SEb

b + t*SEb

t Stat
P-value
Lower 95%
1.059736 0.314190 -0.450992
6.366551 0.000082 0.324252

-6-

Upper 95%
1.269089
0.673409

Calculating Confidence and Prediction Intervals


StatTools provides prediction intervals for individual values, but it does not provide confidence
intervals for the mean of y, given a set of xs.
The Excel file ConfPredInt.xlsx (located in the Materials folder on Blackboard) computes the
confidence interval and the prediction interval for the Car Plant Electricity Usage data.
The same file can be used to calculate the confidence and prediction intervals for any other data
after providing the information in the cells filled with yellow color.
Reminder: The prediction interval is always wider than the confidence interval because there is
much more variation in predicting an individual value than in estimating a mean value.
Normal Probability Plots
The third regression assumption states that the error terms are normally distributed.
You can evaluate this assumption using a histogram or a normal probability plot of the residuals.
One common normal probability plot is called the quantile-quantile or Q-Q plot.
It is a scatter plot of the standardized values from the data set being evaluated versus the values that
would be expected if the data were perfectly normally distributed (with the same mean and standard
deviation as in the data set). If the data are, in fact, normally distributed, the points in this plot tend to
cluster around a 45 o line. Any large deviation from a 45 o line signals some type of non-normality.
The following figure illustrates the typical shape of normal probability plots.
If the data are left-skewed, the curve will rise more rapidly at first and then level off.
If the data are normally distributed, the points will plot along an approximately straight line.
If the data are right-skewed, the data will rise more slowly at first and then rise at a faster rate for
higher values of the variable being plotted.

Figure 3 Normal probability plots for (a) left-skewed,


(b) normal, and (c) right-skewed distributions.
Regression analysis procedures are very robust with modest departures from normality.
Unless the distribution of the residuals is severely non-normal, the inferences made from the
regression output are still approximately valid. In addition, some forms on non-normality can
often be remedied by transformations of the response variable.
Although the Regression tool in Excel offers a normal probability plot, this is not the appropriate
probability plot for the residuals, so do not use this option. With StatTools, select Q-Q Normal Plot
from the Normality Tests dropdown list and check both options at the bottom of the dialog box.
-7-

Example 2
Sunflowers Apparel
The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased
during the past 12 years as the chain has expanded the number of stores. Until now, Sunflowers
managers selected sites based on subjective factors, such as the availability of a good lease or the
perception that a location seemed ideal for an apparel store.
The new director of planning wants to develop a systematic approach that will lead to making
better decisions during the site selection process. He believes that the size of the store significantly
contributes to store sales, and wants to use this relationship in the decision-making process.
To examine the relationship between the store size and its annual sales, data were collected from
a sample of 14 stores. The data are stored in Sunflowers_Apparel.xlsx.
(a) Use Excels Regression tool or StatTools to run a linear regression.
Does a straight line provide a useful mathematical model for this relationship?
Regression Statistics
Multiple R
0.950883
R Square
0.904179
Adjusted R Square
0.896194
Standard Error
0.966380
Observations
14
ANOVA
df
1
12
13

SS
105.747610
11.206676
116.954286

MS
105.747610
0.933890

F
113.233513

Significance F
0.00000018

Coefficients
0.964474
1.669862

Standard Error
0.526193
0.156925

t Stat
1.832927
10.641124

P-value
0.091727
0.000000

Lower 95%
-0.182003
1.327951

Regression
Residual
Total

Square Feet Line Fit Plot


Annual Sales ($million)

Intercept
Square Feet

14
y = 1.6699x + 0.9645
R = 0.9042

12
10
8
6
4
2
0
0

4
Square Feet (thousands)

-8-

Upper 95%
2.110950
2.011773

As the scatter plot and the high r = 0.95 show, as the size of the store increases, annual sales
increase approximately as a straight line. Thus, we can assume that a straight line provides a
useful mathematical model for this relationship.
(b) Can you safely predict the annual sales for a store whose size is 7 thousands of square feet?
The square footage varies from 1.1 to 5.8 thousands of square feet. Therefore annual sales
should be predicted only for stores whose size is between 1.1 to 5.8 thousands of square feet.
It would be improper to use the prediction line to forecast the sales for a new store containing
7,000 square feet because the relationship between sales and store size has a point of
diminishing returns. If that is true, as square footage increases beyond 5,800 square feet, the
effect on sales becomes smaller and smaller.
(c) What are the values of SST, SSR, and SSE? Please verify that SST = SSR + SSE.
SST = 116.954286
SSR = 105.747610
SSE = 11.206676
116.954286 = 105.747610 + 11.206676
(d) Use the above sums to calculate the coefficient of determination and interpret it.
r2

SSR
SST

105.747610
116.954286

0.904179

Therefore, 90.42% of the variation in annual sales is explained by the variability in the size of
the store as measured by the square footage. This large r2 indicates a strong linear relationship
between these two variables because the use of a regression model has reduced the variability in
predicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due to
factors other than what is accounted for by the linear regression model that uses square footage.
(e) Interpret the standard error of the estimate.
s = 0.966380
Recall that the standard error of the estimate represents a measure of the variation around the
prediction line. It is measured in the same units as the dependent variable y.
Here, the typical difference between actual annual sales at a store and the predicted annual sales
using the regression equation is approximately 0.966380 millions of dollars or $966,380.
(f) What are the expected annual sales and the residual value for the last data pair (x = 3, y = 4.1).
Interpret both of these values in business terms.
From the regression output, the expected (or predicted) annual sales equal 5.974061,
indicating that we expect annual sales to be $5,974,061, on average, for a store with size of
3,000 square feet. The residual equals 1.874061, indicating that for the store corresponding
to the last pair in the data set the actual annual sales were $1,874,061 lower than expected.
-9-

(g) Construct a residual plot.


Square Feet Residual Plot
1.5
1

Residuals

0.5
0
-0.5 0

-1
-1.5
-2
-2.5

Square Feet

(h) Use the residual plot to evaluate the regression model assumptions about linearity
(mean of zero), independence, and equal spread of the residuals.
Linearity
Although there is widespread scatter in the residual plot, there is no clear pattern or
relationship between the residuals and xis. The residuals appear to be evenly spread above
and below 0 for different values of x.
Independence
You can evaluate the assumption of independence of the errors by plotting the residuals in
the order or sequence in which the data were collected. If the values of y are part of a time
series, one residual may sometimes be related to the previous residual.
If this relationship exists between consecutive residuals (which violates the assumption of
independence), the plot of the residuals versus the time in which the data were collected will
often show a cyclical pattern. Because the Sunflowers Apparel data were collected during the
same time period, you can assume that the independence assumption is satisfied for these data.
Equal spread
There do not appear to be major differences in the variability of the residuals for different xi
values. Thus, you can conclude that there is no apparent violation in the assumption of equal
spread at each level of x.

- 10 -

(i) Use StatTools to construct a normal probability plot and evaluate the regression model
assumption about normality of the residuals.
Q-Q Normal Plot of Residual / Data Set #2
3.5
Standardized Q-Value

2.5
1.5
0.5

-3.5

-2.5

-0.5
-0.5

-1.5

0.5

1.5

2.5

3.5

-1.5
-2.5
-3.5
Z-Value

From the QQ plot of the residuals, the data do not appear to depart substantially from a
normal distribution. The robustness of regression analysis with modest departures from
normality enables you to conclude that you should not be overly concerned about departures
from this normality assumption in the Sunflowers Apparel data.
Note: Excel does not readily provide a normal probability plot of residuals, but could be used to
get it in the following way. Run a regression with y = Residuals, x = any numbers (for example x
= 1,2,3, ...) and check the Normal Probability Plots box. Here is the result for our example.

Normal Probability Plot


2

Residuals

1
0
-1

20

40

60

80

100

120

-2
-3

Sample Percentile

(j) Find the standard error of the slope coefficient. What does this number indicate?
The standard error of the slope coefficient indicates the uncertainty in the estimated slope.
It measures about how far the estimated slope (the regression coefficient computed from the
sample) differs from the (idealized) true population slope, , due to the randomness of
sampling. Here, the estimated slope b = 1.669862 or $1,669,862 differs from the population
slope by about SEb = 0.156925 or $156,925.
- 11 -

(k) Find and interpret the 95% confidence interval for the slope coefficient.
(Note: In economics language for the slope, this question sounds like the following.
Find the 95% confidence interval for the expected marginal value of an additional 1,000
square feet to Sunflowers Apparel.)
The 95% confidence interval extends from 1.327951 to 2.011773 or $1,327,951 to $2,011,773.
We are 95% confident that the additional annual sales, for an additional 1,000 square feet in
store size, are between $1,327,951 to $2,011,773, on average.
(In economics language: We are 95% sure that the expected marginal value of an additional
1,000 square feet is between $1,327,951 to $2,011,773.)
(l) Use the F statistic from the regression output to determine whether the slope is statistically
significant.
The ANOVA table shows that the computed test statistic is F = 113.2335 and the P-value is
approximately 0. Therefore, you reject the null hypothesis H 0 :
0 and conclude that the
size of the store is significantly related to annual sales.
(m) Use the t statistic from the regression output to determine whether the slope is statistically
significant.
The slope is significantly different from 0. This can be seen either from the P-value of the
test statistic (for t = 10.6411, P-value
0) or from the confidence interval for the slope
(1.33 to 2.01) which does not include 0. This says that the size of the store significantly
contributes to store annual sales. Stores with larger size make larger annual sales, on average.
(n) Use the ConfPredInt.xlsx to construct a 95% confidence interval of the mean annual sales
for the entire population of stores that contain 4,000 square feet (x = 4).
The confidence interval is 6.971119 to 8.316727. Therefore, the mean annual sales are
between $6,971,119 and $8,316,727 for the population of stores with 4,000 square feet.
(o) Use the ConfPredInt.xlsx to construct a 95% prediction interval of the annual sales for an
individual store that contains 4,000 square feet (x = 4).
The prediction interval is 5.433482 to 9.854364. Therefore, with 95% confidence, you predict
that the annual sales for an individual store with 4,000 square feet is between $5,433,482 and
$9,854,364.
(p) Compare the intervals constructed in (n) and (o).
If you compare the results of the confidence interval estimate and the prediction interval
estimate, you see that the width of the prediction interval for an individual store is much wider
than the confidence interval estimate for the mean.

- 12 -

Você também pode gostar