Escolar Documentos
Profissional Documentos
Cultura Documentos
Chapter
14
614
1519T_c14 03/27/2006 07:28 AM Page 615
A multiple regression model with y as a dependent variable and x1, x2, x3, p , xk as inde-
pendent variables is written as
y A B1x1 B2 x2 B3 x3 p Bk xk (1)
where A represents the constant term, B1, B2, B3, p , Bk are the regression coefficients of in-
dependent variables x1, x2, x3, p , xk, respectively, and represents the random error term. This
model contains k independent variables x1, x2, x3, p , and xk. From model (1), it would seem
that multiple regression models can only be used when the relationship between the depend-
ent variable and each independent variable is linear. Furthermore, it also appears as if there
can be no interaction between two or more of the independent variables. This is far from the
truth. In the real world, a multiple regression model can be much more complex. Discussion
of such models is outside the scope of this book. When each term contains a single inde-
pendent variable raised to the first power as in model (1), we call it a first-order multiple
regression model. This is the only type of multiple regression model we will discuss in this
chapter.
In regression model (1), A represents the constant term, which gives the value of y when
all independent variables assume zero values. The coefficients B1, B2, B3, p , and Bk are called
the partial regression coefficients. For example, B1 is a partial regression coefficient of x1. It
gives the change in y due to a one-unit change in x1 when all other independent variables in-
cluded in the model are held constant. In other words, if we change x1 by one unit but keep
x2, x3, p , and xk unchanged, then the resulting change in y is measured by B1. Similarly the
value of B2 gives the change in y due to a one-unit change in x2 when all other independent
variables are held constant. In model (1) above, A, B1, B2, B3, p , and Bk are called the true
regression coefficients or population parameters.
A positive value for a particular Bi in model (1) will indicate a positive relationship between
y and the corresponding xi variable. A negative value for a particular Bi in that model will in-
dicate a negative relationship between y and the corresponding xi variable.
Remember that in a first-order regression model such as model (1), the relationship between
each xi and y is a straight-line relationship. In model (1), A B1x1 B2 x2 B3 x3 p Bk xk
is called the deterministic portion and is the stochastic portion of the model.
When we use the t distribution to make inferences about a single parameter of a multiple
regression model, the degrees of freedom are calculated as
df n k 1
where n represents the sample size and k is the number of independent variables in the model.
Definition
Multiple Regression Model A regression model that includes two or more independent variables
is called a multiple regression model. It is written as
y A B1x1 B2 x2 B3 x3 p Bk xk
where y is the dependent variable, x1, x2, x3, p , xk are the k independent variables, and is the
random error term. When each of the xi variables represents a single variable raised to the first
power as in the above model, this model is referred to as a first-order multiple regression model.
For such a model with a sample size of n and k independent variables, the degrees of freedom are:
df n k 1
When a multiple regression model includes only two independent variables (with k 22,
model (1) reduces to
y A B1x1 B2x2
A multiple regression model with three independent variables (with k 32 is written as
y A B1x1 B2x2 B3x3
1519T_c14 03/27/2006 07:28 AM Page 616
If model (1) is estimated using sample data, which is usually the case, the estimated re-
gression equation is written as
ŷ a b1x1 b2x2 b3 x3 p bk xk (2)
In equation (2), a, b1, b2, b3, p , and bk are the sample statistics, which are the point estimators
of the population parameters A, B1, B2, B3, p , and Bk, respectively.
In model (1), y denotes the actual values of the dependent variable for members of the sam-
ple. In the estimated model (2), ŷ denotes the predicted or estimated values of the dependent
variable. The difference between any pair of y and ŷ values gives the error of prediction. For a
multiple regression model,
SSE a 1y ŷ2 2
Assumption 2: The errors associated with different sets of values of independent variables are
independent. Furthermore, these errors are normally distributed and have a constant standard
deviation, which is denoted by s.
Assumption 3: The independent variables are not linearly related. However, they can have a
nonlinear relationship. When independent variables are highly linearly correlated, it is referred
to as multicollinearity. This assumption is about the nonexistence of the multicollinearity prob-
lem. For example, consider the following multiple regression model:
y A B1x1 B2x2 B3x3
1519T_c14 03/27/2006 07:29 AM Page 617
All of the following linear relationships (and other such linear relationships) between x1, x2, and
x3 should be invalid for this model.
x1 x2 4x3
x2 5x1 2x3
x1 3.5x2
If any linear relationship exists, we can substitute one variable for another, which will reduce
the number of independent variables to two. However, nonlinear relationships, such as x1 4x22
and x2 2x1 6x23 between x1, x2, and x3 are permissible.
In practice, multicollinearity is a major issue. Examining the correlation for each pair of
independent variables is a good way to determine if multicollinearity exists.
Assumption 4: There is no linear association between the random error term and each inde-
pendent variable xi.
The coefficient of multiple determination R2 has one major shortcoming. The value of R2 gen-
erally increases as we add more and more explanatory variables to the regression model (even if
they do not belong in the model). Just because we can increase the value of R2 does not imply
that the regression equation with a higher value of R2 does a better job of predicting the depend-
ent variable. Such a value of R2 will be misleading, and it will not represent the true explanatory
power of the regression model. To eliminate this shortcoming of R2, it is preferable to use the adj-
usted coefficient of multiple determination, which is denoted by R2. Note that R2 is the coeffi-
cient of multiple determination adjusted for degrees of freedom. The value of R2 may increase,
decrease, or stay the same as we add more explanatory variables to our regression model. If a new
variable added to the regression model contributes significantly to explain the variation in y, then
R2 increases; otherwise it decreases. The value of R2 is calculated as follows.
n1 SSE/1n k 12
R2 1 11 R2 2a b or 1
nk1 SST/1n 12
Thus, if we know R2, we can find the vlaue of R2. Almost all statistical software packages give
the values of both R2 and R2 for a regression model.
Another property of R2 to remember is that whereas R2 can never be negative, R2 can be
negative.
While a general rule of thumb is that a higher value of R2 implies that a specific set of inde-
pendent variables does a better job of predicting a specific dependent variable, it is important to
recognize that some dependent variables have a great deal more variability than others. Therefore,
R2 .30 could imply that a specific model is not a very strong model, but it could be the best
possible model in a certain scenario. Many good financial models have values of R2 below .50.
EXAMPLE 14–1
A researcher wanted to find the effect of driving experience and the number of driving viola-
Using MINITAB to find a
tions on auto insurance premiums. A random sample of 12 drivers insured with the same com-
multiple regression equation.
pany and having similar auto insurance policies was selected from a large city. Table 14.1 lists
Table 14.1
Number of
Monthly Premium Driving Experience Driving Violations
(dollars) (years) (past 3 years)
148 5 2
76 14 0
100 6 1
126 10 3
194 4 6
110 8 2
114 11 3
86 16 1
198 3 5
92 9 1
70 19 0
120 13 3
1519T_c14 03/27/2006 07:29 AM Page 619
the monthly auto insurance premiums (in dollars) paid by these drivers, their driving experi-
ences (in years), and the numbers of driving violations committed by them during the past
three years.
Using MINITAB, find the regression equation of monthly premiums paid by drivers on the
driving experiences and the numbers of driving violations.
Solution Let
y the monthly auto insurance premium 1in dollars2 paid by a driver
x1 the driving experience 1in years2 of a driver
x2 the number of driving violations committed by a driver during the past three years
We are to estimate the regression model
y A B1x1 B2x2 (3)
The first step is to enter the data of Table 14.1 into MINITAB spreadsheet as shown in Screen
14.1. Here we have entered the given data in columns C1, C2, and C3 and named them Monthly
Premium, Driving Experience and Driving Violations, respectively.
Screen 14.1
Screen 14.2
Screen 14.3
1519T_c14 03/27/2006 07:29 AM Page 621
EXAMPLE 14–2
Refer to Example 14–1 and the MINITAB solution given in Screen 14.3.
Interpreting parts of the
(a) Explain the meaning of the estimated regression coefficients. MINITAB solution of multiple
regression.
(b) What are the values of the standard deviation of errors, the coefficient of multiple
determination, and the adjusted coefficient of multiple determination?
(c) What is the predicted auto insurance premium paid per month by a driver with seven
years of driving experience and three driving violations committed in the past three
years?
(d) What is the point estimate of the expected (or mean) auto insurance premium paid
per month by all drivers with 12 years of driving experience and 4 driving violations
committed in the past three years?
Solution
(a) From the portion of the MINITAB solution that is marked I in Screen 14.3, the esti-
mated regression equation is
ŷ 110 2.75x1 16.1x2 (4)
From this equation,
a 110, b1 2.75, and b2 16.1
We can also read the values of these coefficients from the column labeled Coef in the
portion of the output marked II in the MINITAB solution of Screen 14.3. From this
column we obtain
a 110.28, b1 2.7473, and b2 16.106
Notice that in this column the coefficients of the regression equation appear with more
digits after the decimal point. With these coefficient values, we can write the estimated
regression equation as
ŷ 110.28 2.7473x1 16.106x2 (5)
The value of a 110.28 in the estimated regression equation (5) gives the value of ŷ
for x1 0 and x2 0. Thus, a driver with no driving experience and no driving
violations committed in the past three years is expected to pay an auto insurance
premium of $110.28 per month. Again, this is the technical interpretation of a. In
reality, that may not be true because none of the drivers in our sample has both zero
experience and zero driving violations. As all of us know, some of the highest premi-
ums are paid by teenagers just after obtaining their drivers licenses.
The value of b1 2.7473 in the estimated regression model gives the change in
ŷ for a one-unit change in x1 when x2 is held constant. Thus, we can state that a driver
with one extra year of experience but the same number of driving violations is ex-
pected to pay $2.7473 (or $2.75) less per month for the auto insurance premium. Note
that because b1 is negative, an increase in driving experience decreases the premium
paid. In other words, y and x1 have a negative relationship.
The value of b2 16.106 in the estimated regression model gives the change in ŷ
for a one-unit change in x2 when x1 is held constant. Thus, a driver with one extra
driving violation during the past three years but with the same years of driving expe-
rience is expected to pay $16.106 (or $16.11) more per month for the auto insurance
premium.
1519T_c14 03/27/2006 07:29 AM Page 622
(b) The values of the standard deviation of errors, the coefficient of multiple determina-
tion, and the adjusted coefficient of multiple determination are given in part III of the
MINITAB solution of Screen 14.3. From this part of the solution,
se 12.1459, R2 93.1%, and R2 91.6%
Thus, the standard deviation of errors is 12.1459. The value of R2 93.1% tells us
that the two independent variables, years of driving experiences and the numbers of
driving violations, explain 93.1% of the variation in the auto insurance premiums. The
value of R2 91.6% is the value of the coefficient of multiple determination adjusted
for degrees of freedom. It states that when adjusted for degrees of freedom, the two
independent variables explain 91.6% of the variation in the dependent variable.
(c) To Find the predicted auto insurance premium paid per month by a driver with seven
years of driving experience and three driving violations during the past three years,
we substitute x1 7 and x2 3 in the estimated regression model (5). Thus,
ŷ 110.28 2.7473x1 16.106x2
110.28 2.7473 172 16.106 132 $139.37
Note that this value of ŷ is a point estimate of the predicted value of y, which is de-
noted by yp. The concept of the predicted value of y is the same as that for a simple
linear regression model discussed in Section 13.8.2 of Chapter 13.
(d) To obtain the point estimate of the expected (mean) auto insurance premium paid per
month by all drivers with 12 years of driving experience and four driving violations
during the past three years, we substitute x1 12 and x2 4 in the estimated
regression equation (5). Thus,
ŷ 110.28 2.7473x1 16.106x2
110.28 2.7473 1122 16.106 142 $141.74
This value of ŷ is a point estimate of the mean value of y, which is denoted by E1y2
or my0x1x2. The concept of the mean value of y is the same as that for a simple linear
regression model discussed in Section 13.8.1 of Chapter 13.
Example 14–3 describes the procedure to make a confidence interval for an individ-
ual regression coefficient Bi.
EXAMPLE 14–3
Determine a 95% confidence interval for B1 (the coefficient of experience) for the multiple re-
Making a confidence interval
gression of auto insurance premium on driving experience and the number of driving viola-
for an individual coefficient of
tions. Use the MINITAB solution of Screen 14.3.
a multiple regression model.
Solution To make a confidence interval for B1, we use the portion marked II in the MINITAB
solution of Screen 14.3. From that portion of the MINITAB solution,
The sample size is 12, which gives n 12. Because there are two independent variables,
k 2. Therefore,
Degrees of freedom n k 1 12 2 1 9
From the t distribution table (Table V of Appendix C), the value of t for .025 area in the right
tail of the t distribution curve and 9 degrees of freedom is 2.262. Then, the 95% confidence
interval for B1 is
Thus, the 95% confidence interval for b1 is 4.96 to .54. That is, we can state with 95%
confidence that for one extra year of driving experience, the monthly auto insurance pre-
mium changes by an amount between $4.96 and $.54. Note that since both limits of the
confidence interval have negative signs, we can also state that for each extra year of driv-
ing experience, the monthly auto insurance premium decreases by an amount between $.54
and $4.96.
By applying the procedure used in Example 14–3, we can make a confidence interval for any
of the coefficients (including the constant term) of a multiple regression model, such as A and
B2 in model (3). For example, the 95% confidence intervals for A and B2, respectively, are
Test Statistic for bi The value of the test statistic t for bi is calculated as
bi Bi
t
sbi
The value of Bi is substituted from the null hypothesis. Usually, but not always, the null hypoth-
esis is H0: Bi 0. The MINITAB solution contains this value of the t statistic.
Example 14–4 illustrates the procedure for testing a hypothesis about a single coefficient.
EXAMPLE 14–4
Using the 2.5% significance level, can you conclude that the coefficient of the number of years
Testing a hypothesis about a
of driving experience in regression model (3) is negative? Use the MINITAB output obtained
coefficient of a multiple
regression model.
in Example 14–1 and shown in Screen 14.3 to perform this test.
= .025
–2.262 0 t
Critical value of t
Note that the observed value of t in Step 4 of Example 14–4 is obtained from the MINITAB
solution only if the null hypothesis is H0 : B1 0. However, if the null hypothesis is that B1 is
equal to a number other than zero, then the t value obtained from the MINITAB solution is no
longer valid. For example, suppose the null hypothesis in Example 14–4 is
H0 : B1 2
and the alternative hypothesis is
H1 : B1 6 2
In this case the observed value of t will be calculated as
b1 B1 2.7473 122
t .765
sb1 .9770
To calculate this value of t, the values of b1 and sb1 are obtained from the MINITAB solution
of Screen 14.3. The value of B1 is substituted from H0.
EXERCISES
CONCEPTS AND PROCEDURES
14.1 How are the coefficients of independent variables in a multiple regression model interpreted? Explain.
14.2 What are the degrees of freedom for a multiple regression model to make inferences about individual
parameters?
1519T_c14 03/27/2006 07:30 AM Page 626
14.3 What kinds of relationships among independent variables are permissible and which ones are not
permissible in a linear multiple regression model?
14.4 Explain the meaning of the coefficient of multiple determination and the adjusted coefficient of mul-
tiple determination for a multiple regression model. What is the difference between the two?
14.5 What are the assumptions of a multiple regression model?
14.6 The following table gives data on variables y, x1, x2, and x3.
y x1 x2 x3
8 18 38 74
11 26 25 64
19 34 24 47
21 38 44 31
7 13 12 79
23 49 48 35
16 28 38 42
27 59 52 18
9 14 17 71
13 21 39 57
y x1 x2
24 98 52
14 51 69
18 74 63
31 108 35
10 33 88
29 119 54
26 99 51
33 141 31
13 47 67
27 103 41
26 111 46
Using MINITAB, find the regression of y on x1 and x2. Using the solution obtained, answer the follow-
ing questions.
a. Write the estimated regression equation.
b. Explain the meaning of the estimated regression coefficients of the independent variables.
1519T_c14 03/27/2006 07:30 AM Page 627
c. What are the values of the standard deviation of errors, the coefficient of multiple determination,
and the adjusted coefficient of multiple determination?
d. What is the predicted value of y for x1 87 and x2 54?
e. What is the point estimate of the expected (mean) value of y for all elements given that x1 95
and x2 49?
f. Construct a 99% confidence interval for the coefficient of x1.
g. Using the 1% significance level, test if the coefficient of x2 in the population regression model is
negative.
APPLICATIONS
14.8 The salaries of workers are expected to be dependent, among other factors, on the number of years
they have spent in school and their work experiences. The following table gives information on the annual
salaries (in thousands of dollars) for 12 persons, the number of years each of them spent in school, and
the total number of years of work experiences.
Salary 52 44 48 77 68 48 59 83 28 61 27 69
Schooling 16 12 13 20 18 16 14 18 12 16 12 16
Experience 6 10 15 8 11 2 12 4 6 9 2 18
Using MINITAB, find the regression of salary on schooling and experience. Using the solution obtained,
answer the following questions.
a. Write the estimated regression equation.
b. Explain the meaning of the estimates of the constant term and the regression coefficients of inde-
pendent variables.
c. What are the values of the standard deviation of errors, the coefficient of multiple determination,
and the adjusted coefficient of multiple determination?
d. How much salary is a person with 18 years of schooling and 7 years of work experience expected
to earn?
e. What is the point estimate of the expected (mean) salary for all people with 16 years of schooling
and 10 years of work experience?
f. Determine a 99% confidence interval for the coefficient of schooling.
g. Using the 1% significance level, test whether or not the coefficient of experience is positive.
14.9 The CTO Corporation has a large number of chain restaurants throughout the United States. The
research department at the company wanted to find if the sales of restaurants depend on the size of the
population within a certain area surrounding the restaurants and the mean income of households in those
areas. The company collected information on these variables for 11 restaurants. The following table gives
information on the weekly sales (in thousands of dollars) of these restaurants, the population (in thou-
sands) within five miles of the restaurants, and the mean annual income (in thousands of dollars) of the
households for those areas.
Sales 19 29 17 21 14 30 33 22 18 27 24
Population 21 15 32 18 47 69 29 43 75 39 53
Income 58 69 49 52 67 76 81 46 39 64 28
Using MINITAB, find the regression of sales on population and income. Using the solution obtained, answer
the following questions.
a. Write the estimated regression equation.
b. Explain the meaning of the estimates of the constant term and the regression coefficients of pop-
ulation and income.
c. What are the values of the standard deviation of errors, the coefficient of multiple determination,
and the adjusted coefficient of multiple determination?
d. What are the predicted sales for a restaurant with 50 thousand people living within a five-mile
area surrounding it and $55 thousand mean annual income of households in that area.
e. What is the point estimate of the expected (mean) sales for all restaurants with 45 thousand people
living within a five-mile area surrounding them and $46 thousand mean annual income of house-
holds living in those areas?
f. Determine a 95% confidence interval for the coefficient of income.
g. Using the 1% significance level, test whether or not the coefficient of population is different from
zero.
1519T_c14 03/27/2006 07:30 AM Page 628
Glossary
Adjusted coefficient of multiple determination Denoted by R 2, Multicollinearity When two or more independent variables in a
it gives the proportion of SST that is explained by the multiple re- regression model are highly correlated.
gression model and is adjusted for the degrees of freedom. Multiple regression model A regression model that contains two
Coefficient of multiple determination Denoted by R2, it gives the or more independent variables.
proportion of SST that is explained by the multiple regression model. Partial regression coefficients The coefficients of independent
First-order multiple regression model When each term in a re- variables in a multiple regression model are called the partial re-
gression model contains a single independent variable raised to the gression coefficients because each of them gives the effect of the
first power. corresponding independent variable on the dependent variable when
all other independent variables are held constant.
Least squares regression model The estimated regression model
obtained by minimizing the sum of squared errors.
1519T_c14 03/27/2006 07:30 AM Page 629
Standard deviation of errors Also called the standard deviation SSR (regression sum of squares) The portion of SST that is
of estimate, it is a measure of the variation among errors. explained by the regression model.
SSE (error sum of squares) The sum of the squared differences SST (total sum of squares) The sum of squared differences
between the actual and predicted values of y. It is the portion of SST between actual y values and y.
that is not explained by the regression model.
Self-Review Test
1. When using the t distribution to make inferences about a single parameter, the degrees of freedom
for a multiple regression model with k independent variables and a sample size of n are equal to
a. n k 1 b. n k 1 c. n k 1
2. The value of R2 is always in the range
a. zero to 1 b. 1 to 1 c. 1 to zero
3. The value of R 2 is
a. always positive b. always nonnegative c. can be positive, zero, or negative
4. What is the difference between the population multiple regression model and the estimated multiple
regression model?
5. Why are the regression coefficients in a multiple regression model called the partial regression
coefficients?
6. What is the difference between R2 and R 2? Explain.
7. A real estate expert wanted to find the relationship between the sale price of houses and various
characteristics of the houses. She collected data on four variables, recorded in the table, for 13 houses that
were sold recently. The four variables are
Price Sale price of a house in thousands of dollars
Lot size Size of the lot in acres
Living area Living area in square feet
Age Age of a house in years
Using MINITAB, find the regression of price on lot size, living area, and age. Using the solution ob-
tained, answer the following questions.
a. Indicate whether you expect a positive or a negative relationship between the dependent variable
and each of the independent variables.
b. Write the estimated regression equation. Are the signs of the coefficients of independent vari-
ables obtained in the solution consistent with your expectations of part a?
c. Explain the meaning of the estimated regression coefficients of all independent variables.
d. What are the values of the standard deviation of errors, the coefficient of multiple determina-
tion, and the adjusted coefficient of multiple determination?
1519T_c14 03/27/2006 07:30 AM Page 630
e. What is the predicted sale price of a house that has a lot size of 2.5 acres, a living area of 3000
square feet, and is 14 years old?
f. What is the point estimate of the mean sale price of all houses that have a lot size of 2.2 acres, a
living area of 2500 square feet, and are 7 years old?
g. Determine a 99% confidence interval for each of the coefficients of the independent variables.
h. Construct a 98% confidence interval for the constant term in the population regression model.
i. Using the 1% significance level, test whether or not the coefficient of lot size is positive.
j. At the 2.5% significance level, test if the coefficient of living area is positive.
k. At the 5% significance level, test if the coefficient of age is negative.
Mini-Project 14–1
Refer to the McDonald’s data set explained in Appendix B and given on the Web site of this text. Use
MINITAB to estimate the following regression model for that data set.
y A B1x1 B2x2 B3x3
where y calories
x1 fat (measured in grams)
x2 carbohydrate (measured in grams)
and x3 protein (measured in grams)
Now research on the Internet or in a book to find the number of calories in one gram of fat, one gram of
carbohydrate, and one gram of protein.
a. Based on the information you obtain, write what the estimated regression equation should be.
b. Are the differences between your expectation in part a and the regression equation that you ob-
tained from MINTAB small or large?
c. Since each gram of fat is worth a specific number of calories, and the same is true for a gram of
carbohydrate, and for a gram of protein, one would expect that the predicted and observed values
of y would be the same for each food item, but that is not the case. The quantities of fat, carbo-
hydrates, and protein are reported in whole numbers. Explain why this causes the differences dis-
cussed in part b.