Escolar Documentos
Profissional Documentos
Cultura Documentos
Topics Outline
Dummy Variables
Interaction Terms
Nonlinear Transformations
Quadratic Transformations
Logarithmic Transformations
Dummy Variables
Thus far, the examples we have considered involved quantitative explanatory variables such as
machine hours, production runs, price, expenditures. In many situations, however, we must work
with categorical explanatory variables such as gender (male, female), method of payment
(cash, credit card, check), and so on. The way to include a categorical variable in the regression
model is to represent it by a dummy variable.
A dummy variable (also called indicator or 0 1 variable) is a variable with possible values 0 and 1.
It equals 1 if a given observation is in a particular category and 0 if it is not.
If a given categorical explanatory variable has only two categories, then you can define one
dummy variable xd to represent the two categories as
1 if the observation is in category 1
xd =
0 otherwise
Example 1
Data collected from a sample of 15 houses are stored in Houses.xlsx.
House
1
2
M
14
15
Value
($ thousands)
234.4
227.4
M
233.8
226.8
Size
(thousands of square feet)
2.00
1.71
M
1.89
1.59
Presence of Fireplace
Yes
No
M
Yes
No
(a) Develop a regression model for predicting the assessed value y of houses, based on the size x1
of the house and whether the house has a fireplace.
To include the categorical variable for the presence of a fireplace, the dummy variable is
defined as
1 if the house has a fireplace
x2 =
0 if the house does not have a fireplace
-1-
Value
234.4
227.4
M
233.8
226.8
Size
2.00
1.71
M
1.89
1.59
Fireplace
1
0
M
1
0
Assuming that the slope of assessed value with the size of the house is the same for houses
that have and do not have a fireplace, the multiple regression model is
y = + 1 x1 + 2 x 2 +
Here are the regression results for this model.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA
0.9006
0.8111
0.7796
2.2626
15
Regression
Residual
Total
df
2
12
14
SS
263.7039
61.4321
325.1360
MS
131.8520
5.1193
F
Significance F
25.7557 0.0000
Intercept
Size
Fireplace
Coefficients
200.0905
16.1858
3.8530
Standard Error
4.3517
2.5744
1.2412
t Stat
45.9803
6.2871
3.1042
P-value
0.0000
0.0000
0.0091
-2-
Lower 95%
190.6090
10.5766
1.1486
Upper 95%
209.5719
21.7951
6.5574
For houses with a fireplace, you substitute x 2 = 1 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(1)
y = 203.9435 + 16.1858x1
y = (a + b2 ) + b1 x1
For houses without a fireplace, you substitute x 2 = 0 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(0)
y = 200.0905 + 16.1858x1
y = a + b1 x1
Interpretation of a
(1)
(2)
The expected value of a house with 0 square feet and no fireplace is $200,091, which
obviously does not make sense in this context.
Interpretation of b1
The effect of x1 on y is the same for houses with or without a fireplace. When x1 increases
by one unit, y is expected to change by b1 units for houses with or without a fireplace.
Thus, holding constant whether a house has a fireplace, for each increase of 1 thousand
square feet in the size of the house, the predicted assessed value is estimated to increase by
16.1858 thousand dollars (i.e., $16,185.80).
Interpretation of b2
The slope of equations (1) and (2) is the same ( b1 = 16.1858 ), but the intercepts differ by an
amount b2 = 3.8530 . Geometrically, the two equations correspond to two parallel lines that
are a vertical distance b2 = 3.8530 apart. Therefore, the interpretation of b2 is that it
indicates the difference between the two intercepts 203.9435 and 200.0905.
Thus, holding constant the size of the house, the presence of a fireplace is estimated to
increase the predicted assessed value of the house by 3.8530 thousand dollars (i.e. $3,853).
-3-
(c) Does the regression equation provide a good fit for the observed data?
The test statistic for the slope of the size of the house with assessed value is 6.2871,
and the P-value is approximately zero.
The test statistic for presence of a fireplace is 3.1042, and the P-value is 0.0091.
Thus, each of the two variables makes a significant contribution to the model.
In addition, the coefficient of determination indicates that 81.11% of the variation in assessed
value is explained by variation in the size of the house and whether the house has a fireplace.
When a categorical variable has two categories (fireplace, no fireplace), one dummy variable is used.
When a categorical variable has m categories, m 1 dummy variables are required, with each
dummy variable coded as 0 or 1.
Example 2
Define a multiple regression model using sales (y) as the response variable and price ( x1 ) and
package design as explanatory variables. Package design is a three-level categorical variable
with designs A, B, or C.
Solution:
To model the m = 3-level categorical variable package design, m 1 = 3 1 = 2 dummy
variables are needed:
1 if package design A is used
x2 =
0 otherwise
1 if package design B is used
x3 =
0 otherwise
Therefore, the regression model is
y = + 1 x1 + 2 x 2 + 3 x3 +
Here the package design is coded as:
Package design
A ( x2 )
B ( x3 )
B
C
0
0
1
0
-4-
Interaction Terms
In the regression models discussed so far, the effect an explanatory variable has on the response
variable has been assumed to be independent of the other explanatory variables in the model.
An interaction occurs if the effect of an explanatory variable on the response variable changes
according to the value of a second explanatory variable.
For example, it is possible for advertising to have a large effect on the sales of a product when
the price of a product is low. However, if the price of the product is too high, increases in
advertising will not dramatically change sales. In other words, you cannot make general
statements about the effect of advertising on sales. The effect that advertising has on sales is
dependent on the price. Therefore, price and advertising are said to interact.
When interaction between two variables is present, we cannot study the effect of one variable on
the response y independently of the other variable. Meaningful conclusions can be developed
only if we consider the joint effect that both variables have on the response.
To account for the effect of two explanatory variables xi and x j acting together, an interaction
term (sometimes referred to as a cross-product term) x i x j is added to the model.
Example 1 (Continued)
(d) Formulate a regression model to evaluate whether an interaction exists.
In the regression model, we assumed that the effect the size of the home has on the assessed
value is independent of whether the house has a fireplace. In other words, we assumed that the
slope of assessed value with size is the same for houses with fireplaces as it is for houses
without fireplaces. If these two slopes are different, an interaction exists between the size of
the home and the fireplace.
To evaluate whether an interaction exists, the following model is considered:
y = + 1 x1 + 2 x 2 + 3 x1 x 2 +
where x1 x 2 is the interaction term. With this new x 3 = x1 x 2 variable in the model, it means
the value of x 2 changes how x1 affects y.
If we factor out x1 we get:
y = + ( 1 + 3 x 2 ) x1 + 2 x 2 +
Thus, each value of x 2 yields a different slope in the relationship between y and x1 .
Expressed in other words, the parameter 3 of the interaction term gives an adjustment to the
slope of x1 for the possible values of x 2 .
-5-
Value
234.4
227.4
M
233.8
226.8
Size
2.00
1.71
M
1.89
1.59
Fireplace
1
0
M
1
0
Size Fireplace
2.00
0.00
M
1.89
0.00
0.9179
0.8426
0.7996
2.1573
15
ANOVA
Regression
Residual
Total
df
3
11
14
SS
MS
F
Significance F
273.9441 91.3147 19.6215
0.0001
51.1919 4.6538
325.1360
Intercept
Size
Fireplace
Size*Fireplace
Coefficients
Standard Error
t Stat P-value
212.9522
9.6122 22.1544 0.0000
8.3624
5.8173 1.4375 0.1784
-11.8404
10.6455 -1.1122 0.2898
9.5180
6.4165 1.4834 0.1661
Lower 95%
Upper 95%
191.7959
234.1084
-4.4414
21.1662
-35.2710
11.5902
-4.6046
23.6406
-6-
(3)
y = (a + b2 ) + (b1 + b3 ) x1
For houses without a fireplace, x 2 = 0 and the regression equation is:
(4)
y = a + b1 x1
Interpretation of b2
The coefficient of the indicator variable b2 = 11.8404 provides a different intercept to
separate the houses with and without a fireplace at the origin (where Size = 0 sq ft).
Here it does not make great sense. Literally, it says that for houses with size of 0 sq ft, the
value of a house without a fireplace is about $11,840 higher than the value of a house with a
fireplace.
Interpretation of b3
The coefficient of the interaction term b3 = 9.5180 says that the slope relating the size of the house
to its value is steeper by $9,518 for houses with a fireplace than for houses without a fireplace.
The two lines, (3) and (4), meet at (size, value) = (1.2440, 223.3550). Thus, the value of a
house with a size greater than 1,244 sq ft is higher when the house has a fireplace.
-7-
(f) Does the interaction term make a significant contribution to the regression model?
To test for the existence of an interaction, the null and alternative hypotheses are:
H 0 : 3 = 0
H a : 3 0
The test statistic for the interaction of size and fireplace is 1.4834 with a P-value = 0.1661.
Because the P-value is large, you do not reject the null hypothesis.
Thus, although the slope adjuster b3 = 9.5180 implies the value gap between houses with
and without fireplace increases with house size, this effect is not really significant.
In other words, the interaction term does not make a significant contribution to the model,
given that size and presence of a fireplace are already included. Therefore, you can conclude
that the slope of assessed value with size is the same for houses with fireplaces and without
fireplaces.
Note:
If the correlation between interaction terms and the original variables in the regression is high,
collinearity problems can result. In a regression with several variables, the number of interaction
variables that could be created is very large and the likelihood of collinearity problems is high.
Therefore, it is wise not to use interaction variables indiscriminately. There should be some good
reason to suspect that two variables might be related or some specific question that can be
answered by an interaction variable before this type of variable is used.
-8-
Nonlinear Transformations
The general linear model has the form
y = + 1 x1 + 2 x 2 + L + k x k +
It is linear in the sense that the right side of the equation is a constant plus a sum of products of
constants and variables. However, there is no requirement that the response variable y or the
explanatory variables x1 through x k be the original variables in the data set. Most often they are,
but they can also be transformations of original variables. You can transform the response
variable y or any of the explanatory variables, the xs. You can also do both.
The purpose of nonlinear transformations is usually to straighten out the points in a
scatterplot in order to overcome violations of the assumptions of regression or to make the form
of a model linear. They can also arise because of economic considerations. Among the many
transformations available are the square root, the reciprocal, the square, and transformations
involving the common logarithm (base 10) and the natural logarithm (base e).
The type of transformation to correct for curvilinearity is not always obvious. Different
transformations may be tried and the one that appears to do the best job chosen.
There may be theoretical results as well to support the use of certain transformations in certain cases.
As always, subject matter expertise is important in any analysis. If several different transformations
straighten out the data equally well, the one that is easiest to interpret is preferred.
The most frequently used nonlinear transformations in business and economic applications are
the quadratic and logarithmic transformations.
Quadratic Transformations
One of the most common nonlinear relationships between the response variable y and an explanatory
variable x is a curvilinear relationship in which y increases (or decreases) at a changing rate for
various values of x. The quadratic regression model defined below can be used to analyze this type
of relationship between x and y.
y = + 1 x1 + 2 x12 +
This model is similar to the multiple regression model except that the second explanatory
variable is the square of the first explanatory variable. Once again, the least squares method can
be used to compute sample regression coefficients a, b1 , and b2 as estimates of the population
parameters , 1 , and 2 . The estimated regression equation for the quadratic model is
y = a + b1 x1 + b2 x12
In this equation, the first regression coefficient a represents the y intercept; the second
regression coefficient b1 represents the linear effect; and the third regression coefficient b2
represents the quadratic effect.
-9-
Example 3
Fly Ash
Fly ash is an inexpensive industrial waste by-product that can be used as a substitute for Portland cement,
a more expensive ingredient of concrete. How does adding fly ash affect the strength of concrete?
Batches of concrete were prepared in which the percentage of fly ash ranged from 0% to 60%.
Data were collected from a sample of 18 batches and stored in FlyAsh.xlsx.
Batch
1
2
M
17
18
Strength (psi)
4779
4706
M
5030
4648
Fly Ash %
0
0
M
60
60
(a) A linear model has been fit to these data. Below is the regression output.
What do these results show?
Regression Statistics
Multiple R
0.4275
R Square
0.1827
Adjusted R Square
0.1317
Standard Error
460.7787
Observations
18
ANOVA
df
SS
Regression
MS
Significance F
759618.0571
759618.0571
Residual
16
3397072.4429
212317.0277
Total
17
4156690.5000
Coefficients
Standard Error
4924.5952
213.2991
23.0877
0.0000
4472.4213
5376.7691
10.4171
5.5074
1.8915
0.0768
-1.2579
22.0922
Intercept
Fly Ash%
t Stat
3.5778
P-value
0.0768
Lower 95%
Upper 95%
1000
6500
6000
500
5500
Residuals
Strength (psi)
5000
4500
4000
0
20
40
0
4900
-500
5100
5300
60
-1000
Fly Ash %
- 10 -
Predicted Strength
5500
The t test indicates that the linear term is significant at the 0.10 (but not at the 0.05) level of
significance (P-value = 0.0768). The extremely low coefficient of determination
( r 2 = 0.1827) shows that the linear model explains only about 18% of the variation in strength.
Moreover, the scatterplot of the data and the plot of residuals versus fitted values indicate that
a linear model is not appropriate for these data. For example, the scatterplot of Strength versus
Fly Ash % indicates an initial increase in the strength of the concrete as the percentage of fly
ash increases. The strength appears to level off and then drop after achieving maximum
strength at about 40% fly ash. Strength for 50% fly ash is slightly below strength at 40%,
but strength at 60% is substantially below strength at 50%.
Therefore, to estimate strength based on fly ash percentage, a quadratic model seems more
appropriate for these data, not a linear model.
(b) The data for the quadratic model are:
Batch
1
2
M
17
18
Strength (psi)
4779
4706
M
5030
4648
Fly Ash %
0
0
M
60
60
500
Residuals
300
18
100
-1004000
4500
5000
5500
6000
-300
-500
Predicted Strength
ANOVA
df
Regression
Residual
Total
Intercept
Fly Ash%
Fly Ash%^2
2
15
17
SS
2695473.4897
1461217.0103
4156690.5000
MS
1347736.7448
97414.4674
F
13.8351
Significance F
0.0004
Coefficients
4486.3611
63.0052
-0.8765
Standard Error
174.7531
12.3725
0.1966
t Stat
25.6726
5.0923
-4.4578
P-value
0.0000
0.0001
0.0005
Lower 95%
4113.8836
36.6338
-1.2955
- 11 -
Upper 95%
4858.8386
89.3767
-0.4574
The curved pattern in the residual plot is gone. The points in the residual plot show a random
scatter with an approximately equal spread above and below the horizontal 0 line.
From the regression output: a = 4,486.3611
Therefore, the quadratic regression equation is
b1 = 63.0052
b2 = 0.8765
Strength (psi)
6000
5500
5000
4500
4000
0
10
20
30
40
50
60
Fly Ash %
- 12 -
r2
se
y = + 1 x1 +
18%
461
y = + 1 x1 + 2 x12 +
65%
312
Model
The coefficient of determination r 2 = 0.6485 shows that about 65% of the variation in strength is
explained by the quadratic relationship between strength and the percentage of fly ash.
The percentage variation explained by the linear model is much smaller: about 18%.
Another indicator that the regression has been improved by adding the quadratic term is the reduction
in the standard error s e from about 461 in the linear model to about 312 in the quadratic model.
Thus, based on our findings in (a) through (f), we can conclude that the quadratic model is
significantly better than the linear model for representing the relationship between strength
and fly ash percentage.
Note:
Although this was not the case in this example, but it can happen that in a quadratic model the
quadratic term is significant and the linear term is not. In such situations (for statistical reasons
not discussed here), the general rule is to keep the linear term despite of its insignificance.
- 13 -
Logarithmic Transformations
If scatterplots suggest nonlinear relationships, there are many nonlinear transformations of y
and/or the xs that could be tried in a regression analysis. The reason that logarithmic
transformations are arguably the most frequently used nonlinear transformations, besides the fact
that they often produce good fits, is that they can be interpreted naturally in terms of percentage
changes. In real studies, this interpretability is an important advantage over other potential
nonlinear transformations.
The log transformations put values on a different scale that compresses large distances so that
they are more comparable to smaller distances.
It is common in business and economic applications to use natural ln (base e) logarithms,
although the base used is usually not important.
Interpretation of a slope coefficient b when log is used
Case 1: Predicted y = a + L + b log x + L
The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately 0.01b.
Example: Predicted y = 5.67 + 0.34 log x
This regression equation implies that every 1% increase in x (for example, from 200 to 202) is
accompanied by about (0.01)(0.34) = 0.0034 increase in y.
Case 2: Predicted log y = a + L + bx + L
The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately b%.
Example: Predicted log y = 5.67 + 0.34 log x
For every 1% increase in x , y is expected to increase by approximately 0.34%.
- 14 -
Example 4
Fuel Consumption
The file Fuel_Consumption.xlsx contains data on the fuel consumption in gallons per capita for
each of the 50 states and Washington, DC. Here is part of the data.
State
Alabama
Alaska
M
Wyoming
Washington D.C.
FuelCon
547.92
440.38
M
715.55
289.99
Population
4486508
643786
M
498703
570898
Area
Density
50750
88.4041
570374
1.1287
M
M
97105
5.1357
61 9358.9836
The goal is to develop a regression equation to predict fuel consumption based on the population
density (defined as population/area).
The scatterplot of FuelCon versus Density is shown below.
Scatterplot of FuelCon versus Density
750
FuelCon
650
550
450
350
250
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Density
FuelCon
650
550
450
350
250
0
10
LogDensity
r2
se
y = + x +
20.6%
65.17
y = + log x +
27.8%
62.16
Model
The regression results indicate that using LogDensity as the explanatory variable produces a
better model fit than the regression using Density.
The estimated regression equation for the logarithmic model is
Predicted FuelCon = 597.1867 24.5308 Log Density
This equation shows that if the population density increases by 1%, the average fuel
consumption will decrease by (0.01)(24.5308) = 0.2453 gallons per capita.
- 16 -
Intercept
Density
1
49
50
Coefficients
495.6283
-0.0251
SS
MS
F
53960.7466 53960.7466 12.7062
208093.4001 4246.8041
262054.1466
Standard Error
9.4811
0.0070
Significance F
0.0008
t Stat
P-value
52.2752 0.0000
-3.5646 0.0008
Lower 95%
476.5752
-0.0392
SS
MS
F
72748.3136 72748.3136 18.8302
189305.8330 3863.3843
262054.1466
Significance F
0.0001
Upper 95%
514.6814
-0.0109
1
49
50
Intercept
LogDensity
Coefficients
597.1867
-24.5308
Standard Error
26.9612
5.6531
- 17 -
t Stat
P-value
22.1499 0.0000
-4.3394 0.0001
Lower 95%
543.0062
-35.8911
Upper 95%
651.3671
-13.1705
Example 5
Imports and GDP
The gross domestic product (GDP) and dollar amount of total imports (Imports), both in billions
of dollars for 25 countries are saved in Imports_and_GDP.xlsx.
Country
Argentina
Australia
M
United Kingdom
United States
Imports
GDP
20.300
391.000
68.000
528.000
M
M
330.100 1520.000
1148.000 10082.000
The objective is to find an equation showing the relationship between Imports (y) and GDP (x).
The scatterplot of Imports versus GDP shows that this is not a linear relationship.
Scattreplot of Imports versus GDP
1200
1000
Imports
800
600
400
200
0
0
2000
4000
6000
8000
10000
GDP
At the left-hand side of the x axis and the bottom of the y axis, the values are clumped together.
Moving from left to right on the x axis, the values become more spread out. The same thing
happens when moving up the y axis the values become progressively more spread out.
This suggests the use of a log transformation for both the x and y variables.
As the scatterplot of LogImports versus LogGDP below shows, the relationship appears much
closer to linear.
- 18 -
LogImports
-2
10
LogGDP
The results for the regression of LogImports on LogGDP are shown below.
The regression of Imports on GDP is not shown for comparison purposes, because the response
variable y has been transformed to log y and the usual comparisons are not valid. In particular,
the interpretations of se and r2 are different because the units of the response variable are
completely different. For example, increases in r 2 when the natural logarithm transformation is
applied to y do not necessarily suggest an improved model. Because of the above, it is difficult
to compare this regression to any model using y as the response variable.
Note that transformations of the explanatory variables do not create this type of problem.
It is only when the y variable is transformed that comparison becomes more difficult.
Regression Statistics
Multiple R
0.9168
R Square
0.8404
Adjusted R Square
0.8335
Standard Error
0.9142
Observations
25
ANOVA
df
Regression
1
Residual
23
Total
24
Intercept
LogGDP
Coefficients
-1.1275
0.8670
SS
MS
F
101.2551 101.2551 121.1527
19.2226
0.8358
120.4777
Standard Error
0.4346
0.0788
t Stat
-2.5941
11.0069
- 19 -
P-value
0.0162
0.0000
Significance F
0.0000
Lower 95%
-2.0265
0.7041
Upper 95%
-0.2284
1.0300
log y = + log x +
- 20 -