Você está na página 1de 5

ISyE4031 Regression and Forecasting

Practice Problems 2
Fall 2014
1. In mobile ad hoc computer networks, messages must be forwarded from computer to computer
until they reach their destinations. The data overhead is the number of bytes of information that
must be transmitted, and it measures the success of a protocol. A regression study was
performed to predict the data overhead (kB) by using the independent variables average speed
(m/s), pause time (s), and link change rate (LCR, 100/s). The studied model was
y 0 1x1 2 x2 3 x3 4 x1x3 5 x2 x3 6 x12
The regression equation is
Overhead = 406 1.93 Speed 0.387 Pause + 1.02 LCR 0.0618 Speed*LCR
+ 0.290 Pause*LCR + 0.0401 Speed**2
Predictor
Coef SE Coef
Constant 406.491
9.032
Speed
-1.932
1.137
Pause
-0.3866 0.2202
LCR
1.0153
0.9208
Speed*LCR -0.06182 0.02102
Pause*LCR 0.28985 0.04259
Speed**2
0.04013 0.02056

T
45.01
-1.70
-1.76
1.10
-2.94
6.81
1.95

P Significant?(Y/N) Keep? Why?


0.000
0.107
0.096
0.285
0.009
0.000
0.067

a. By looking at the output and by considering the p values, state whether a predictor should be
kept in the model by using = 0.05, and why? Fill in the table.
b. What would the expected data overhead be if the average speed is 15 m/s, the pause time is 10
s, and the LCR is 20?
2. The weekly sales (in $1000 per week) for fast-food outlets in each of four cities were
collected. The objective is to model sales (y) as a function of traffic flow (in 1,000 of cars),
adjusting for city-to-city variations that might be due to size or other market conditions. The
linear regression model is therefore:
y = + x + C +C + C +, where x is traffic flow,

1, if city 1
1, if city 2
C1 =
, C2 =
, and C3 =
0, otherwise
0, otherwise
and City 4 is the base level.

1, if city 3
,

0, otherwise

The following output was obtained.


SALES = 1.08 - 1.22 CITY_1 - 0.531 CITY_2 - 1.08 CITY_3 + 0.104 TRAFFIC
Predictor
Constant
CITY_1
CITY_2
CITY_3
TRAFFIC

Coef
1.0834
-1.2158
-0.5308
-1.0765
0.103673

S = 0.362307

SE Coef
0.3210
0.2054
0.2848
0.2265
0.004094

R-Sq = 97.9%

T
3.37
-5.92
-1.86
-4.75
25.32

P
0.003
0.000
0.078
0.000
0.000

R-Sq(adj) = 97.5%

Analysis of Variance
Source
DF
SS
Regression
4 116.656
Residual Error 19
2.494
Total
23 119.150

MS
29.164
0.131

F
222.17

P
0.000

Answer the following questions by using = 0.05.


a. Are the mean weekly sales identical for all four cities? Why? I.e., support your answer by
hypothesis testing and expected values. Use =0.05.
b. Does City 2 have more expected sales than City 4? State the hypothesis that you consider
explicitly.
c. Which city has the most expected sales? Explain by referring to the coefficients, hypothesis
testing, and expected values.
d. What is the expected sales in City 3 when the traffic flow is recorded as 60,000 cars?
e. Suppose that you modeled and solved the problem as a simple linear regression model:
y = + x + , where y: Sales and x: Traffic flow. In order to decide on the significance of
the cities, compare this reduced model and the complete model by using a partial (nested) F test.
State the hypothesis explicitly.
SALES = 0.018 + 0.108 TRAFFIC
Analysis of Variance
Source
DF
SS
MS
Regression
1 111.34 111.34
Residual Error 22
7.81
0.35
Total
23 119.15

F
313.75

P
0.000

3. Screening Techniques.
a. In a linear regression study five predictors are being evaluated by using a stepwise regression.
By considering the quantities given below, perform the first two steps of the stepwise regression,
i.e., write down the selected variable(s) in step1 and step 2. Assume entry= remove= 0.10.
Step1 Predictors
Step2

t-stat for each Xj


p-value for each Xj
Predictors
t-stat for each Xj
p-value for each Xj
Predictors
t-stat for each Xj
p-value for each Xj

X1
11.21
0.000
X1, X2
3.91,9.92
0.002,0.000
X2, X4
0.72,-0.41
0.486,0.690

X2
22.90
0.000
X1, X3
10.41,2.37
0.000,0.033
X2, X5
7.19,1.34
0.000,0.200

X3
2.75
0.015
X1, X4
3.84,9.67
0.002,0.000
X3, X4
-3.31,23.86
0.005,0.000

X4
22.41
0.000
X1, X5
3.06,2.74
0.009,0.016
X3, X5
2.02,9.48
0.063,0.000

X5
10.71
0.000
X2, X3
24.43,-3.40
0.000,0.004
X4, X5
6.97,1.39
0.000,0.250

Step 1:
Step 2:

b. Consider the best subset output given below. What is the best subset of variables? State your
reasons.
2

Response is y
Vars
1
1
2
2
3
3
4
4
5
5
6

R-Sq
51.2
47.6
93.9
79.6
97.2
95.8
98.8
97.6
99.0
98.8
99.0

Mallows
Cp
144.9
156.0
14.9
59.1
6.6
11.1
3.7
7.3
5.2
5.7
7.0

R-Sq(adj)
45.1
41.1
92.1
73.8
95.9
93.7
97.9
95.8
97.7
97.3
97.1

S
2496.8
2587.1
945.09
1726.6
686.37
848.68
492.34
694.63
510.32
549.94
574.97

x x x x x x
1 2 3 4 5 6
X
X
X
X
X
X
X
X X
X X
X
X X
X X
X X
X X
X X X
X X
X X X X X
X X X X X X

4. Residual analysis and diagnostics.


a. A linear regression model, y = + x + x + , was fitted to 24 heat treatment data
points, and several test results for diagnostics statistics were produced. Consider the following
four of those 24 observations and state whether they are unusual or not. If an observation is
unusual, is it outlier, leverage, and/or influential? State your reasons explicitly by referring to the
diagnostics statistics.
Observation
1
2
3
4

y
0.013
0.068
0.056
0.014

SRES1
-1.47200
-3.17206
-0.12900
-2.87807

TRES1
-1.50366
-3.33242
-0.12679
-2.90441

HI1
0.053974
0.689127
0.269057
0.058732

COOK1
0.04121
3.48608
1.02204
0.11762

Observation #1:
Observation #2:
Observation #3:
Observation #4:
5. The data on annual sale revenues (in billions of dollars) of the Eastman Kodak Company over
a 25-year period were considered. The following time series plot depicts the actual annual
revenues (y) over the 25 years.
Time Series Plot of Y
20.0
17.5

15.0
12.5
10.0
7.5
5.0
2

10

12
14
Index

16

18

20

22

24

A quadratic trend model, yt = + t + t2 + was studied and the following results were
obtained.
The regression equation is
Y = 3.15 + 1.46 t - 0.0411 t^2
Predictor
Constant
t
t^2

Coef
3.152
1.4617
-0.041122

S = 2.04889

SE Coef
1.137
0.2194
0.008832

R-Sq = 80.6%

Analysis of Variance
Source
DF
SS
Regression
2 384.04
Residual Error 22
92.36
Total
24 476.39

T
2.77
6.66
-4.66

P
0.011
0.000
0.000

R-Sq(adj) = 78.9%
MS
192.02
4.20

F
45.74

P
0.000

a. Would you consider this model and the predictors useful (significant)? Support your answer
by looking at the p values, and the tests. Assume = 0.10.
b. Would you change your answer in part (a) when you saw the residual plots given below by
considering the assumptions on errors? In other words, check whether the results justify the
basic error term assumptions or not, i.e., i ~ i.i.d. Normal(0, 2). State the hypotheses that you
are testing and the residual plot that you are referring to explicitly, and use .
Residual Plots for Y
Normal Probability Plot

Versus Fits

99

Residual

Percent

90
50
10
1
-5.0

-2.5

0.0
Residual

2.5

2
0
-2
-4

5.0

10
Fitted Value

Versus Order

Residual

Frequency

Histogram

0
-2

2
0

15

-3

-2

-1

0
1
Residual

-4

8 10 12 14 16 18 20 22 24
Observation Order

A-D = 0.397, p-value = 0.343


Durbin-Watson statistic = 0.532552
- Each i has a normal distribution:
- Each i has an identical distribution:
- Each i is independent:
6. In a chemical experiment, the true relationship between yield (y) and reaction time (x) is
assumed to be: y = 1e 2 x e .

a. First, apply a transformation to the equation so that a simple linear regression solution can be
found.
b. Then, consider the solution to the transformed model: y * = -2.3+0.6 x.
What are 1 , 2 , and the predicted value of y (i.e., y ) when x = 5?
7. Answer the following short-answer questions.
a. If the variance inflation factor (VIF) is not less than 1, we can say that there exists
multicollinearity between independent variables. True or False?
b. An unusual data point can be both an outlier and a high leverage point. True or False?
c. What can we detect when t-tests for all (or nearly all) parameters are non-significant
whereas the F-test for overall model adequacy (H0: 1==k = 0) is significant?
d. Suppose a fitted linear regression model is y = 15 + 2 x1 3x2 + x22 + 4 x1 x2. What is the
amount of change in the expected value of y for every one-unit increase in x1, holding x2 fixed
at 2?

Você também pode gostar