13 Multiple Regression Part3

MULTIPLE REGRESSION PART 3
Topics Outline
Dummy Variables
Interaction Terms
Nonlinear Transformations
Quadratic Transformations
Logarithmic Transformations
Dummy Variables
Thus far, the examples we have considered involved quantitative explanatory variables such as
machine hours, production runs, price, expenditures. In many situations, however, we must work
with categorical explanatory variables such as gender (male, female), method of payment
(cash, credit card, check), and so on. The way to include a categorical variable in the regression
model is to represent it by a dummy variable.
A dummy variable (also called indicator or 0 1 variable) is a variable with possible values 0 and 1.
It equals 1 if a given observation is in a particular category and 0 if it is not.
If a given categorical explanatory variable has only two categories, then you can define one
dummy variable xd to represent the two categories as
1 if the observation is in category 1
xd =
0 otherwise
Example 1
Data collected from a sample of 15 houses are stored in Houses.xlsx.
House
1
2
M
14
15
Value
($ thousands)
234.4
227.4
M
233.8
226.8
Size
(thousands of square feet)
2.00
1.71
M
1.89
1.59
Presence of Fireplace
Yes
No
M
Yes
No
(a) Develop a regression model for predicting the assessed value y of houses, based on the size x1
of the house and whether the house has a fireplace.
To include the categorical variable for the presence of a fireplace, the dummy variable is
defined as
1 if the house has a fireplace
x2 =
0 if the house does not have a fireplace
-1-
To code this dummy variable in Excel, enter

=IF(C2="Yes",1,0)
in cell D2 and drag it down. The data become:
House
1
2
M
14
15
Value
234.4
227.4
M
233.8
226.8
Size
2.00
1.71
M
1.89
1.59
Fireplace
1
0
M
1
0
Assuming that the slope of assessed value with the size of the house is the same for houses
that have and do not have a fireplace, the multiple regression model is
y = + 1 x1 + 2 x 2 +
Here are the regression results for this model.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA
0.9006
0.8111
0.7796
2.2626
15
Regression
Residual
Total
df
2
12
14
SS
263.7039
61.4321
325.1360
MS
131.8520
5.1193
F
Significance F
25.7557 0.0000
Intercept
Size
Fireplace
Coefficients
200.0905
16.1858
3.8530
Standard Error
4.3517
2.5744
1.2412
t Stat
45.9803
6.2871
3.1042
P-value
0.0000
0.0000
0.0091
(b) Interpret the regression coefficients.

The regression equation is
y = 200.0905 + 16.1858x1 + 3.8530x2

y = a + b1 x1 + b2 x2
-2-
Lower 95%
190.6090
10.5766
1.1486
Upper 95%
209.5719
21.7951
6.5574
For houses with a fireplace, you substitute x 2 = 1 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(1)
y = 203.9435 + 16.1858x1
y = (a + b2 ) + b1 x1
For houses without a fireplace, you substitute x 2 = 0 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(0)
y = 200.0905 + 16.1858x1
y = a + b1 x1
Interpretation of a
(1)
(2)
The expected value of a house with 0 square feet and no fireplace is $200,091, which
obviously does not make sense in this context.
Interpretation of b1
The effect of x1 on y is the same for houses with or without a fireplace. When x1 increases
by one unit, y is expected to change by b1 units for houses with or without a fireplace.
Thus, holding constant whether a house has a fireplace, for each increase of 1 thousand
square feet in the size of the house, the predicted assessed value is estimated to increase by
16.1858 thousand dollars (i.e., $16,185.80).
The slope of equations (1) and (2) is the same ( b1 = 16.1858 ), but the intercepts differ by an
amount b2 = 3.8530 . Geometrically, the two equations correspond to two parallel lines that
are a vertical distance b2 = 3.8530 apart. Therefore, the interpretation of b2 is that it
indicates the difference between the two intercepts 203.9435 and 200.0905.
Thus, holding constant the size of the house, the presence of a fireplace is estimated to
increase the predicted assessed value of the house by 3.8530 thousand dollars (i.e. $3,853).
-3-
(c) Does the regression equation provide a good fit for the observed data?
The test statistic for the slope of the size of the house with assessed value is 6.2871,
and the P-value is approximately zero.
The test statistic for presence of a fireplace is 3.1042, and the P-value is 0.0091.
Thus, each of the two variables makes a significant contribution to the model.
In addition, the coefficient of determination indicates that 81.11% of the variation in assessed
value is explained by variation in the size of the house and whether the house has a fireplace.
When a categorical variable has two categories (fireplace, no fireplace), one dummy variable is used.
When a categorical variable has m categories, m 1 dummy variables are required, with each
dummy variable coded as 0 or 1.
Example 2
Define a multiple regression model using sales (y) as the response variable and price ( x1 ) and
package design as explanatory variables. Package design is a three-level categorical variable
with designs A, B, or C.
Solution:
To model the m = 3-level categorical variable package design, m 1 = 3 1 = 2 dummy
variables are needed:
1 if package design A is used
x2 =
0 otherwise
1 if package design B is used
x3 =
0 otherwise
Therefore, the regression model is
y = + 1 x1 + 2 x 2 + 3 x3 +
Here the package design is coded as:
Package design
A ( x2 )
B ( x3 )
B
C
0
0
1
0
-4-
Interaction Terms
In the regression models discussed so far, the effect an explanatory variable has on the response
variable has been assumed to be independent of the other explanatory variables in the model.
An interaction occurs if the effect of an explanatory variable on the response variable changes
according to the value of a second explanatory variable.
For example, it is possible for advertising to have a large effect on the sales of a product when
the price of a product is low. However, if the price of the product is too high, increases in
advertising will not dramatically change sales. In other words, you cannot make general
statements about the effect of advertising on sales. The effect that advertising has on sales is
dependent on the price. Therefore, price and advertising are said to interact.
When interaction between two variables is present, we cannot study the effect of one variable on
the response y independently of the other variable. Meaningful conclusions can be developed
only if we consider the joint effect that both variables have on the response.
To account for the effect of two explanatory variables xi and x j acting together, an interaction
term (sometimes referred to as a cross-product term) x i x j is added to the model.
Example 1 (Continued)
(d) Formulate a regression model to evaluate whether an interaction exists.
In the regression model, we assumed that the effect the size of the home has on the assessed
value is independent of whether the house has a fireplace. In other words, we assumed that the
slope of assessed value with size is the same for houses with fireplaces as it is for houses
without fireplaces. If these two slopes are different, an interaction exists between the size of
the home and the fireplace.
To evaluate whether an interaction exists, the following model is considered:
y = + 1 x1 + 2 x 2 + 3 x1 x 2 +
where x1 x 2 is the interaction term. With this new x 3 = x1 x 2 variable in the model, it means
the value of x 2 changes how x1 affects y.
If we factor out x1 we get:
y = + ( 1 + 3 x 2 ) x1 + 2 x 2 +
Thus, each value of x 2 yields a different slope in the relationship between y and x1 .
Expressed in other words, the parameter 3 of the interaction term gives an adjustment to the
slope of x1 for the possible values of x 2 .
-5-
(e) Interpret the estimated regression equation.

The data for the model with an interaction term are:
House
1
2
M
14
15
Value
234.4
227.4
M
233.8
226.8
Size
2.00
1.71
M
1.89
1.59
Fireplace
1
0
M
1
0
Size Fireplace
2.00
0.00
M
1.89
0.00
The regression output for this model is:

Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.9179
0.8426
0.7996
2.1573
15
ANOVA
Regression
Residual
Total
df
3
11
14
SS
MS
F
Significance F
273.9441 91.3147 19.6215
0.0001
51.1919 4.6538
325.1360
Intercept
Size
Fireplace
Size*Fireplace
Coefficients
Standard Error
t Stat P-value
212.9522
9.6122 22.1544 0.0000
8.3624
5.8173 1.4375 0.1784
-11.8404
10.6455 -1.1122 0.2898
9.5180
6.4165 1.4834 0.1661
Lower 95%
Upper 95%
191.7959
234.1084
-4.4414
21.1662
-35.2710
11.5902
-4.6046
23.6406
The estimated regression equation is

y = a + b1 x1 + b2 x 2 + b3 x1 x 2
y = 212.9522 + 8.3624x1 11.8404x2 + 9.5180 x1 x2

To see the interaction effect, we have to evaluate this equation for the possible values of x 2 .
-6-
For houses with a fireplace, x 2 = 1 and the regression equation is:
y = 212.9522 + 8.3624 x1 11.8404(1) + 9.5180x1 (1)

y = 201.1118 + 17.8804x1
(3)
y = (a + b2 ) + (b1 + b3 ) x1
For houses without a fireplace, x 2 = 0 and the regression equation is:
y = 212.9522 + 8.3624x1 11.8404(0) + 9.5180x1 (0)

y = 212.9522 + 8.3624x1
(4)
y = a + b1 x1
The coefficient of the indicator variable b2 = 11.8404 provides a different intercept to
separate the houses with and without a fireplace at the origin (where Size = 0 sq ft).
Here it does not make great sense. Literally, it says that for houses with size of 0 sq ft, the
value of a house without a fireplace is about $11,840 higher than the value of a house with a
fireplace.
The coefficient of the interaction term b3 = 9.5180 says that the slope relating the size of the house
to its value is steeper by $9,518 for houses with a fireplace than for houses without a fireplace.
The two lines, (3) and (4), meet at (size, value) = (1.2440, 223.3550). Thus, the value of a
house with a size greater than 1,244 sq ft is higher when the house has a fireplace.
-7-
(f) Does the interaction term make a significant contribution to the regression model?
To test for the existence of an interaction, the null and alternative hypotheses are:
H 0 : 3 = 0
H a : 3 0
The test statistic for the interaction of size and fireplace is 1.4834 with a P-value = 0.1661.
Because the P-value is large, you do not reject the null hypothesis.
Thus, although the slope adjuster b3 = 9.5180 implies the value gap between houses with
and without fireplace increases with house size, this effect is not really significant.
In other words, the interaction term does not make a significant contribution to the model,
given that size and presence of a fireplace are already included. Therefore, you can conclude
that the slope of assessed value with size is the same for houses with fireplaces and without
fireplaces.
Note:
If the correlation between interaction terms and the original variables in the regression is high,
collinearity problems can result. In a regression with several variables, the number of interaction
variables that could be created is very large and the likelihood of collinearity problems is high.
Therefore, it is wise not to use interaction variables indiscriminately. There should be some good
reason to suspect that two variables might be related or some specific question that can be
answered by an interaction variable before this type of variable is used.
-8-
Nonlinear Transformations
The general linear model has the form
y = + 1 x1 + 2 x 2 + L + k x k +
It is linear in the sense that the right side of the equation is a constant plus a sum of products of
constants and variables. However, there is no requirement that the response variable y or the
explanatory variables x1 through x k be the original variables in the data set. Most often they are,
but they can also be transformations of original variables. You can transform the response
variable y or any of the explanatory variables, the xs. You can also do both.
The purpose of nonlinear transformations is usually to straighten out the points in a
scatterplot in order to overcome violations of the assumptions of regression or to make the form
of a model linear. They can also arise because of economic considerations. Among the many
transformations available are the square root, the reciprocal, the square, and transformations
involving the common logarithm (base 10) and the natural logarithm (base e).
The type of transformation to correct for curvilinearity is not always obvious. Different
transformations may be tried and the one that appears to do the best job chosen.
There may be theoretical results as well to support the use of certain transformations in certain cases.
As always, subject matter expertise is important in any analysis. If several different transformations
straighten out the data equally well, the one that is easiest to interpret is preferred.
The most frequently used nonlinear transformations in business and economic applications are
the quadratic and logarithmic transformations.
Quadratic Transformations
One of the most common nonlinear relationships between the response variable y and an explanatory
variable x is a curvilinear relationship in which y increases (or decreases) at a changing rate for
various values of x. The quadratic regression model defined below can be used to analyze this type
of relationship between x and y.
y = + 1 x1 + 2 x12 +
This model is similar to the multiple regression model except that the second explanatory
variable is the square of the first explanatory variable. Once again, the least squares method can
be used to compute sample regression coefficients a, b1 , and b2 as estimates of the population
parameters , 1 , and 2 . The estimated regression equation for the quadratic model is
y = a + b1 x1 + b2 x12
In this equation, the first regression coefficient a represents the y intercept; the second
regression coefficient b1 represents the linear effect; and the third regression coefficient b2
represents the quadratic effect.
-9-
Example 3
Fly Ash
Fly ash is an inexpensive industrial waste by-product that can be used as a substitute for Portland cement,
a more expensive ingredient of concrete. How does adding fly ash affect the strength of concrete?
Batches of concrete were prepared in which the percentage of fly ash ranged from 0% to 60%.
Data were collected from a sample of 18 batches and stored in FlyAsh.xlsx.
Batch
1
2
M
17
18
Strength (psi)
4779
4706
M
5030
4648
Fly Ash %
0
0
M
60
60
(a) A linear model has been fit to these data. Below is the regression output.
What do these results show?
Multiple R
0.4275
R Square
0.1827
Adjusted R Square
0.1317
Standard Error
460.7787
Observations
18
ANOVA
df
SS
Regression
MS
Significance F
759618.0571
759618.0571
Residual
16
3397072.4429
212317.0277
Total
17
4156690.5000
Coefficients
Standard Error
4924.5952
213.2991
23.0877
0.0000
4472.4213
5376.7691
10.4171
5.5074
1.8915
0.0768
-1.2579
22.0922
Intercept
Fly Ash%
t Stat
3.5778
P-value
0.0768
Lower 95%
Upper 95%
1000
6500
6000
500
5500
Residuals
Strength (psi)
5000
4500
4000
0
20
40
0
4900
-500
5100
5300
60
-1000
Fly Ash %
- 10 -
Predicted Strength
5500
The t test indicates that the linear term is significant at the 0.10 (but not at the 0.05) level of
significance (P-value = 0.0768). The extremely low coefficient of determination
( r 2 = 0.1827) shows that the linear model explains only about 18% of the variation in strength.
Moreover, the scatterplot of the data and the plot of residuals versus fitted values indicate that
a linear model is not appropriate for these data. For example, the scatterplot of Strength versus
Fly Ash % indicates an initial increase in the strength of the concrete as the percentage of fly
ash increases. The strength appears to level off and then drop after achieving maximum
strength at about 40% fly ash. Strength for 50% fly ash is slightly below strength at 40%,
but strength at 60% is substantially below strength at 50%.
Therefore, to estimate strength based on fly ash percentage, a quadratic model seems more
appropriate for these data, not a linear model.
(b) The data for the quadratic model are:
Batch
1
2
M
17
18
Strength (psi)
4779
4706
M
5030
4648
Fly Ash %
0
0
M
60
60
(Fly Ash %)2

0
0
M
3600
3600
Below are the regression results for the quadratic model.

What does the residual plot show? What is the estimated regression equation?
Multiple R
0.8053
R Square
0.6485
Adjusted R Square
0.6016
Standard Error
312.1129
Observations
500
Residuals
300
18
100
-1004000
4500
5000
5500
6000
-300
-500
Predicted Strength
ANOVA
df
Regression
Residual
Total
Intercept
Fly Ash%
Fly Ash%^2
2
15
17
SS
2695473.4897
1461217.0103
4156690.5000
MS
1347736.7448
97414.4674
F
13.8351
Significance F
0.0004
Coefficients
4486.3611
63.0052
-0.8765
Standard Error
174.7531
12.3725
0.1966
t Stat
25.6726
5.0923
-4.4578
P-value
0.0000
0.0001
0.0005
Lower 95%
4113.8836
36.6338
-1.2955
- 11 -
Upper 95%
4858.8386
89.3767
-0.4574
The curved pattern in the residual plot is gone. The points in the residual plot show a random
scatter with an approximately equal spread above and below the horizontal 0 line.
From the regression output: a = 4,486.3611
Therefore, the quadratic regression equation is
b1 = 63.0052
b2 = 0.8765
y = 4,486.3611 + 63.0052 x1 0.8765 x12

Predicted Strength = 4,486.3611 + 63.0052 Fly Ash% 0.8765 (Fly Ash%)2
The following figure is a scatterplot that shows the fit of the quadratic regression curve to the
original data. (In Excel, use the option Polynomial to add the trendline.)
Scatterplot of Fly Ash and Strength
6500
y = -0.8765x2 + 63.005x + 4486.4
Strength (psi)
6000
5500
5000
4500
4000
0
10
20
30
40
50
60
Fly Ash %
(c) Interpret the regression coefficients.

The y intercept 4,486.3611 is the predicted strength when the percentage of fly ash is 0.
To interpret the coefficients b1 = 63.0052 and b2 = 0.8765, observe that after an initial
increase, strength decreases as fly ash percentage increases. This nonlinear relationship is
further demonstrated by predicting the strength for fly ash percentages of 20, 40, and 60.
Using the quadratic regression equation,
Predicted Strength = 4,486.3611 + 63.0052 FlyAsh% 0.8765 Fly Ash%^2
For FlyAsh% = 20, Predicted Strength = 4,486.3611 + 63.0052(20) 0.8765(20)2 = 5,395.865
Thus, the predicted concrete strength for 40% fly ash is 208.304 psi above the predicted
strength for 20% fly ash, but the predicted strength for 60% fly ash is 492.896 psi below the
predicted strength for 40% fly ash.
- 12 -
(d) Test the significance of the quadratic model.

The null and alternative hypotheses for testing whether there is a significant overall
relationship between strength y and fly ash percentage x1 are as follows:
H 0 : 1 = 2 = 0
(There is no overall relationship between x1 and y.)
H 0 : 1 and/or 2 0 (There is an overall relationship between x1 and y.)

The overall F test statistic used for this test is F = 13.8351. The corresponding P-value is 0.0004.
Because of the small P-value, you reject the null hypothesis and conclude that there is a
significant overall relationship between strength and fly ash percentage.
(e) Test the quadratic effect.
To test the significance of the contribution of the quadratic term, you use the following null
and alternative hypotheses:
H 0 : 2 = 0 (Including the quadratic term does not significantly improve the model.)
H 0 : 2 0 (Including the quadratic term significantly improves the model.)
The test statistic and the corresponding P-value are: t = 4.4578, P-value = 0.0005
You reject H 0 and conclude that the quadratic term is statistically significant and should be
kept in the model.
(f) How good is the quadratic model? Is it better than the linear model?
r2
se
y = + 1 x1 +
18%
461
y = + 1 x1 + 2 x12 +
65%
312
Model
The coefficient of determination r 2 = 0.6485 shows that about 65% of the variation in strength is
explained by the quadratic relationship between strength and the percentage of fly ash.
The percentage variation explained by the linear model is much smaller: about 18%.
Another indicator that the regression has been improved by adding the quadratic term is the reduction
in the standard error s e from about 461 in the linear model to about 312 in the quadratic model.
Thus, based on our findings in (a) through (f), we can conclude that the quadratic model is
significantly better than the linear model for representing the relationship between strength
and fly ash percentage.
Note:
Although this was not the case in this example, but it can happen that in a quadratic model the
quadratic term is significant and the linear term is not. In such situations (for statistical reasons
not discussed here), the general rule is to keep the linear term despite of its insignificance.
- 13 -
Logarithmic Transformations
If scatterplots suggest nonlinear relationships, there are many nonlinear transformations of y
and/or the xs that could be tried in a regression analysis. The reason that logarithmic
transformations are arguably the most frequently used nonlinear transformations, besides the fact
that they often produce good fits, is that they can be interpreted naturally in terms of percentage
changes. In real studies, this interpretability is an important advantage over other potential
nonlinear transformations.
The log transformations put values on a different scale that compresses large distances so that
they are more comparable to smaller distances.
It is common in business and economic applications to use natural ln (base e) logarithms,
although the base used is usually not important.
Interpretation of a slope coefficient b when log is used
Case 1: Predicted y = a + L + b log x + L
(x is log-transformed, y is not log-transformed)
The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately 0.01b.
Example: Predicted y = 5.67 + 0.34 log x
This regression equation implies that every 1% increase in x (for example, from 200 to 202) is
accompanied by about (0.01)(0.34) = 0.0034 increase in y.
Case 2: Predicted log y = a + L + bx + L
(x is not log-transformed, y is log-transformed)
Whenever x increases by 1 unit, the expected value of y changes (increases or decreases

depending on the sign of b) by a constant percentage, and this percentage is approximately equal
to b written as a percentage (that is, 100b%).
Example: Predicted log y = 5.67 + 0.34 x
b = 0.34 and written as a percentage it is (100)(0.34) = 34%.
When x increases by 1 unit, the expected value of y increases by approximately 34%.
Case 3: Predicted log y = a + L + b log x + L
(both x and y are log-transformed)
The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately b%.
Example: Predicted log y = 5.67 + 0.34 log x
For every 1% increase in x , y is expected to increase by approximately 0.34%.
- 14 -
Example 4
Fuel Consumption
The file Fuel_Consumption.xlsx contains data on the fuel consumption in gallons per capita for
each of the 50 states and Washington, DC. Here is part of the data.
State
Alabama
Alaska
M
Wyoming
Washington D.C.
FuelCon
547.92
440.38
M
715.55
289.99
Population
4486508
643786
M
498703
570898
Area
Density
50750
88.4041
570374
1.1287
M
M
97105
5.1357
61 9358.9836
The goal is to develop a regression equation to predict fuel consumption based on the population
density (defined as population/area).
The scatterplot of FuelCon versus Density is shown below.
Scatterplot of FuelCon versus Density
750
FuelCon
650
550
450
350
250
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Density
Looking at the scatterplot, it is clear that this is not a linear relationship.

One thing to note about this plot is how the values spread out on the x axis. At the left-hand side
of the x axis, the values are clumped together. Moving from left to right, the values become
progressively more spread out. This suggests the use of a log transformation of Density.
The log transformation evens out the successively larger distances between the values.
The scatterplot of FuelCon versus the natural logarithm of Density (LogDensity) is shown below.
- 15 -
Scatterplot of FuelCon versus LogDensity

750
FuelCon
650
550
450
350
250
0
10
LogDensity
The relationship appears to be linear.

On the next page are the regression results using Density and using LogDensity as an
explanatory variable. The following table provides summary statistics for the two models.
r2
se
y = + x +
20.6%
65.17
y = + log x +
27.8%
62.16
Model
The regression results indicate that using LogDensity as the explanatory variable produces a
better model fit than the regression using Density.
The estimated regression equation for the logarithmic model is
Predicted FuelCon = 597.1867 24.5308 Log Density
This equation shows that if the population density increases by 1%, the average fuel
consumption will decrease by (0.01)(24.5308) = 0.2453 gallons per capita.
- 16 -
Regression of FuelCon on Density

Multiple R
0.4538
R Square
0.2059
Adjusted R Square
0.1897
Standard Error
65.1675
Observations
51
ANOVA
df
Regression
Residual
Total
Intercept
Density
1
49
50
Coefficients
495.6283
-0.0251
SS
MS
F
53960.7466 53960.7466 12.7062
208093.4001 4246.8041
262054.1466
Standard Error
9.4811
0.0070
Significance F
0.0008
t Stat
P-value
52.2752 0.0000
-3.5646 0.0008
Lower 95%
476.5752
-0.0392
SS
MS
F
72748.3136 72748.3136 18.8302
189305.8330 3863.3843
262054.1466
Significance F
0.0001
Upper 95%
514.6814
-0.0109
Regression of FuelCon on LogDensity

Multiple R
0.5269
R Square
0.2776
Adjusted R Square
0.2629
Standard Error
62.1561
Observations
51
ANOVA
df
Regression
Residual
Total
1
49
50
Intercept
LogDensity
Coefficients
597.1867
-24.5308
Standard Error
26.9612
5.6531
- 17 -
t Stat
P-value
22.1499 0.0000
-4.3394 0.0001
Lower 95%
543.0062
-35.8911
Upper 95%
651.3671
-13.1705
Example 5
Imports and GDP
The gross domestic product (GDP) and dollar amount of total imports (Imports), both in billions
of dollars for 25 countries are saved in Imports_and_GDP.xlsx.
Country
Argentina
Australia
M
United Kingdom
United States
Imports
GDP
20.300
391.000
68.000
528.000
M
M
330.100 1520.000
1148.000 10082.000
The objective is to find an equation showing the relationship between Imports (y) and GDP (x).
The scatterplot of Imports versus GDP shows that this is not a linear relationship.
Scattreplot of Imports versus GDP
1200
1000
Imports
800
600
400
200
0
0
2000
4000
6000
8000
10000
GDP
At the left-hand side of the x axis and the bottom of the y axis, the values are clumped together.
Moving from left to right on the x axis, the values become more spread out. The same thing
happens when moving up the y axis the values become progressively more spread out.
This suggests the use of a log transformation for both the x and y variables.
As the scatterplot of LogImports versus LogGDP below shows, the relationship appears much
closer to linear.
- 18 -
LogImports
Scatterplot of LogImports versus LogGDP

8
7
6
5
4
3
2
1
0
-1
-2
-2
10
LogGDP
The results for the regression of LogImports on LogGDP are shown below.
The regression of Imports on GDP is not shown for comparison purposes, because the response
variable y has been transformed to log y and the usual comparisons are not valid. In particular,
the interpretations of se and r2 are different because the units of the response variable are
completely different. For example, increases in r 2 when the natural logarithm transformation is
applied to y do not necessarily suggest an improved model. Because of the above, it is difficult
to compare this regression to any model using y as the response variable.
Note that transformations of the explanatory variables do not create this type of problem.
It is only when the y variable is transformed that comparison becomes more difficult.
Multiple R
0.9168
R Square
0.8404
Adjusted R Square
0.8335
Standard Error
0.9142
Observations
25
ANOVA
df
Regression
1
Residual
23
Total
24
Intercept
LogGDP
Coefficients
-1.1275
0.8670
SS
MS
F
101.2551 101.2551 121.1527
19.2226
0.8358
120.4777
Standard Error
0.4346
0.0788
t Stat
-2.5941
11.0069
- 19 -
P-value
0.0162
0.0000
Significance F
0.0000
Lower 95%
-2.0265
0.7041
Upper 95%
-0.2284
1.0300
The regression model is
log y = + log x +
The estimated regression equation is

Predicted LogImports = 1.1275 + 0.8670 LogGDP
The slope coefficient 0.8670 indicates that if the GDP increases by 1%, then the Imports are
expected to increase by approximately 0.8670% (about 1%).
If the estimated regression equation is used for forecasting, natural logs of the y values are
forecasted, not the y values themselves. For example, what is the forecast of Imports for a
country with GDP = 500 billions of dollars?
Using the estimated regression equation,
LogImports = 1.1275 + 0.8670 LogGDP
= 1.1275 + 0.8670 Log(500)
= 1.1275 + 0.8670 (6.2146)
= 4.2606
The forecast value for y (Imports) must be computed as
Imports = e 4.2606 = 70.85 71 billions of dollars
- 20 -

13 Multiple Regression Part3

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

13 Multiple Regression Part3

Enviado por

Direitos autorais:

Formatos disponíveis

MULTIPLE REGRESSION PART 3

To code this dummy variable in Excel, enter

(b) Interpret the regression coefficients.

y = 200.0905 + 16.1858x1 + 3.8530x2

(e) Interpret the estimated regression equation.

The regression output for this model is:

The estimated regression equation is

y = 212.9522 + 8.3624x1 11.8404x2 + 9.5180 x1 x2

For houses with a fireplace, x 2 = 1 and the regression equation is:

y = 212.9522 + 8.3624 x1 11.8404(1) + 9.5180x1 (1)

y = 212.9522 + 8.3624x1 11.8404(0) + 9.5180x1 (0)

(Fly Ash %)2

Below are the regression results for the quadratic model.

y = 4,486.3611 + 63.0052 x1 0.8765 x12

y = -0.8765x2 + 63.005x + 4486.4

(c) Interpret the regression coefficients.

(d) Test the significance of the quadratic model.

(There is no overall relationship between x1 and y.)

H 0 : 1 and/or 2 0 (There is an overall relationship between x1 and y.)

(x is log-transformed, y is not log-transformed)

(x is not log-transformed, y is log-transformed)

Whenever x increases by 1 unit, the expected value of y changes (increases or decreases

(both x and y are log-transformed)

Looking at the scatterplot, it is clear that this is not a linear relationship.

Scatterplot of FuelCon versus LogDensity

The relationship appears to be linear.

Regression of FuelCon on Density

Regression of FuelCon on LogDensity

Scatterplot of LogImports versus LogGDP

The regression model is

The estimated regression equation is

Você também pode gostar