Você está na página 1de 11

STAT E-150 - Statistical Methods

Inference for Regression



We have discussed how to find a simple linear regression model to make predictions. Now
we will investigate our model further:

- How do we evaluate the effectiveness of the model?
- How do we know that the relationship is significant?
- How much of the variability in the response variable can be explained by its
relationship to the predictor?


The simple linear model is of the form y =
1
x +
0
+ where ~ N(0,

).
Recall that inference methods can address questions about the population based on our
sample data.


Consider our earlier example:

Medical researchers have noted that adolescent females are more likely to deliver low-
birthweight babies than are adult females. Because LBW babies tend to have higher mortality
rates, studies have been conducted to examine the relationship between birthweight and the
mothers age.

One such study is discussed in the article Body Size and Intelligence in 6-Year-Olds: Are
Offspring of Teenage Mothers at Risk? (Maternal and Child Health Journal [2009], pp. 847-
856.) The following data is consistent with summary values given in the article, and with data
published by the National Center for Health Statistics:

Observation 1 2 3 4 5 6 7 8 9 10
Maternal Age 15 17 18 15 16 19 17 16 18 19
Birthweight (g) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

We found the fitted regression model 245.15 age 1163.4 ig 5 we ht

How can we assess this model? If the slope
1
is equal to zero, then there is no change in
the response variable associated with a change in the predictor. We will therefore test the
value of the slope to investigate whether there is a linear relationship between the two
variables.


Page 2


Assumptions for the model and the errors:

1. Linearity Assumption
Straight Enough Condition: does the scatterplot appear linear?
Check the residuals to see if they appear to be randomly scattered
Quantitative Data Condition: Is the data quantitative?

2. Independence Assumption: the errors must be mutually independent
Randomization Condition: the individuals are a random sample
Check the residuals for patterns, trends, clumping

3. Equal Variance Assumption: the variability of y should be about the same for all values
of x
Does The Plot Thicken? Condition:
Is the spread about the line nearly constant in the scatterplot?
Check the residuals for any patterns

4. Normal Population Assumption: the errors follow a Normal model at each value of x
Nearly Normal Condition:
Look at a histogram or NPP of the residuals


We can check the conditions to see if all assumptions are true. If this is the case, the
idealized regression line will have a distribution of y-values for each x-value, and these
distributions will be approximately normal with equal variation and with means along the
regression line:


T-test for the Slope of a Simple Linear Model

H
0
:
1
=0
H
a
:
1
0

The test statistic is
1 1 1 1
1 1 1

0
t

SE( ) SE( ) SE( )

with n-2 degrees of freedom.

Page 3
y

Here are the steps:


1. Create a scatterplot to see if the data is straight enough.
2. Fit a regression model and find the residuals () and predicted values ( ).
3. Draw a scatterplot of the residuals vs. x or ; this should have no pattern or bend, or
thickening, or thinning, or outliers.
4. If the scatterplot is straight enough, create a histogram and NPP of the residuals to
check the nearly normal condition.
5. Continue with the inference if all conditions are reasonably satisfied.


Here are the results for our data:

- The scatterplot of the data indicates a
positive linear relationship between the
variables








- The scatterplot of the residuals shows no
particular pattern











- The Normal Probability Plot for the
residuals indicates a Normal distribution



























Page 4
Here is our data:

Observation 1 2 3 4 5 6 7 8 9 10
Maternal Age 15 17 18 15 16 19 17 16 18 19
Birthweight (g) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

H
0
:

H
a
:



The SPSS ouput included this table:
Coefficients
a

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound
1
(Constant) -1163.450 783.138

-1.486 .176 -2969.369 642.469
Age 245.150 45.908 .884 5.340 .001 139.285 351.015
a. Dependent Variable: Birthweight



0

=

What does this value represent?




1

=

What does this value represent?



1
SE( ) =

What does this value represent?

Page 5
Coefficients
a

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound
1
(Constant) -1163.450 783.138

-1.486 .176 -2969.369 642.469
Age 245.150 45.908 .884 5.340 .001 139.285 351.015
a. Dependent Variable: Birthweight


1
1

t = =
SE( )



What is the value of the test statistic?

What is the p-value?

What is your decision?


What can you conclude?
Page 6
Confidence Interval for the Slope

The slope of the population regression line,
1
, is the rate of change of the mean response
as the explanatory variable increases.

The slope of the least squares line,
1

, is an estimate of
1
. A confidence interval for the
slope will show how accurate the estimate is.

Coefficients
a

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B
B Std. Error Beta Lower Bound Upper Bound
1
(Constant) -1163.450 783.138

-1.486 .176 -2969.369 642.469
Age 245.150 45.908 .884 5.340 .001 139.285 351.015
a. Dependent Variable: Birthweight


What is the confidence interval for the slope of the regression line?






Page 7

Confidence Interval for the Slope of a Simple Linear Model

The confidence interval for
1
has the form
1
*
1

t SE



where t
*
is the critical value for the t
n-2
density curve to obtain the desired confidence level.


How can we construct this value?
We know that
1

= 245.15 and that

1
SE( ) = 45.908, but how do we find t
*
?
We know that df = n - 2 = 8. On the line for df=8, the value of t for a 95% confidence
interval is 2.306.








Page 8
Calculate
1
*
1

t SE



= 245.15 2.306(45.908)

= 245.15 105.86 = (139.29, 351.01)


What is the confidence interval found by SPSS?

What does this interval tell us?




Partitioning Variability - ANOVA

ANOVA measures the effectiveness of the model by measuring how much of the variability
in the response variable y is explained by the predictions based on the fitted model.

We can partition this variability into two parts: the variability explained by the model, and
the unexplained variability due to error, as measured by the residuals.

In our SPSS output,

SS(Model) =

SS(Error) =

SS(Total) =

ANOVA
a

Model Sum of Squares df Mean Square F Sig.
1
Regression 1201970.450 1 1201970.450 28.515 .001
b

Residual 337212.450 8 42151.556

Total 1539182.900 9

a. Dependent Variable: Birthweight
b. Predictors: (Constant), Age






Page 9
Test for Simple Linear Regression


H
0
:
1
= 0
H
a
:
1
0

The test statistic is
MS(Model)
F =
MS(Error)


ANOVA
a

Model Sum of Squares df Mean Square F Sig.
1
Regression 1201970.450 1 1201970.450 28.515 .001
b

Residual 337212.450 8 42151.556

Total 1539182.900 9

a. Dependent Variable: Birthweight
b. Predictors: (Constant), Age


What is the value of the test statistic?

What is the p-value?

Decision:

Conclusion:





The Coefficient of Determination

This represents the amount of variation in the response variable can be explained by the
model:

2
SS(model)
r
SS(total)



In our example,
2
SS(model) 1201970.45
r = = =.7544
SS(total) 1539182.9


This tells us that 75.44% of the variation in the birthweights can be explained by the
model.

Page 10
Inference for Correlation

There is a relationship between the correlation between the variables and the slope of the
least squares line:


Y
1
X
s

r
s


So testing the value of the slope is the same as testing that there is no correlation between
the variables; that is, that the correlation coefficient = 0.









T-Test for Correlation

H
0
:

= 0
H
a
:

0


The three tests we have discussed are equivalent; it can be shown that the test statistic F
is the square of the test statistic t.



















T-Test for Correlation

H
0
:

= 0
H
a
:

0

The test statistic is
2
r n- 2
t =
1- r


If the conditions of the simple linear model hold, we find the p-value using the t-
distribution with n-2 degrees of freedom.

Page 11
Confidence Intervals and Prediction Intervals

Let x

be a specific value of x. The predicted value of y

is
0 1

y x

We can create two different intervals:

a prediction interval for an individual value of x


a confidence interval for the mean predicted value at x



The basic format for an interval is
*
n 2

y t SE



When we want to find a mean predicted value,

2
2 2 e
1
s

SE( ) SE ( ) (x x)
n


When we want to find an individual predicted value,

2
2 2 2 e
1 e
s

SE(y ) SE ( ) (x x) s
n


Since individual values vary more than means,

SE( ) <

SE(y )
and the prediction interval will be wider than the confidence interval.

Você também pode gostar