Escolar Documentos
Profissional Documentos
Cultura Documentos
Students
of
Executive program in
Business Analytics and
Business Intelligence
Organized
by
IIM Ranchi
Edited By: Dr. K. Maddulety, NITIE, Mumbai,
Mail: koila@rediffmail.com
1
1Using Statistics
Lines
Planes
B
B
Slope: 1
x1
Intercept: 0
x2
x2
2
where0isisthe
theY-intercept
Y-interceptof
ofthe
the
where
0
x1
regression
surface
and
each
,
i
regression surface and each i i, i
y 0 1 x1 2 x 2
=1,2,...,k
1,2,...,kisisthe
theslope
slopeof
ofthe
the
=
regressionsurface
surface--sometimes
sometimes
regression
Model
assumptions:
Model
assumptions:
called
theresponse
response
surface
- errors.
2
called
the
surface
1.
~N(0,
of
other
2 ), independent
1. ~N(0, ), independent of other errors.
with
respect
toXXX.i. are uncorrelated with the error term.
with
respect
to
2. The
The
variables
2.
variables
Xi iare uncorrelated with the error term.
i
x1
y b0 b1x
X
Inaasimple
simpleregression
regression
In
model,the
theleastleastmodel,
squaresestimators
estimators
squares
minimizethe
thesum
sumof
of
minimize
squarederrors
errorsfrom
fromthe
the
squared
estimatedregression
regression
estimated
line.
line.
x2
y b0 b1 x1 b2 x 2
Inaamultiple
multipleregression
regression
In
model,the
theleast-squares
least-squares
model,
estimatorsminimize
minimizethe
the
estimators
sumof
ofsquared
squarederrors
errorsfrom
from
sum
theestimated
estimatedregression
regression
the
plane.
plane.
Y b0 b1 X 1 b2 X 2 bk X k
where isisthe
thepredicted
predictedvalue
valueof
ofY,
Y,the
thevalue
valuelying
lyingon
onthe
theestimated
estimated
where
regression
surface. The
Theterms
termsbb0,...,k
,...,kare
arethe
theleast-squares
least-squaresestimates
estimates
regression
surface.
0
Y
of
the
populationregression
regressionparameters
parameters.i.
of the population
i
Theactual,
actual,observed
observedvalue
valueof
ofYYis
isthe
thepredicted
predictedvalue
valueplus
plusan
an
The
error:
error:
=bb0+
+bb1xx1j+
+bb2xx2j+.
+.....+
+bbkxxkj+e
+e
yyj j=
0
1
1j
2
2j
k
kj
Least-Squares Estimation:
The 2-Variable Normal Equations
Minimizing the
the sum
sum of
of squared
squared errors
errors with
with
Minimizing
respect to
to the
the estimated
estimated coefficients
coefficients bb00,, bb11,,
respect
and bb22 yields
yields the
the following
following normal
normal
and
equations:
equations:
y nb b x b x
0
x y b x b x b x x
2
x y b x b x x b x
2
2
2
Example 11-1
YY XX11 XX22
72 12
12 55
72
76 11
11 88
76
78 15
15 66
78
70 10
10 55
70
68 11
11 33
68
80 16
16 99
80
82 14
14 12
12
82
65 88
65
44
62 88
62
33
90 18
18 10
10
90
--- ----- ------743 123
123 65
65
743
Normal Equations:
Equations:
Normal
743=
=10b
10b0+123b
+123b1+65b
+65b2
743
0
1
2
9382=
=123b
123b0+1615b
+1615b1+869b
+869b2
9382
0
1
2
5040=
=65b
65b0+869b
+869b1+509b
+509b2
5040
0
1
2
=47.164942
47.164942
bb00=
=1.5990404
1.5990404
bb11=
5040
5040
=1.1487479
1.1487479
bb22=
Y 47164942
.
15990404
.
X 1 11487479
.
X2
Total deviation: Y Y
y
Y Y: Error Deviation
Y Y : Regression Deviation
x1
x2
TotalDeviation
Deviation==Regression
RegressionDeviation
Deviation++Error
ErrorDeviation
Deviation
Total
SST
SST
=
=
SSR
SSR
+ SSE
SSE
+
Squares
Regression SSR
Error
Total
SSE
SST
Freedom
k
n - (k+1)
n-1
Mean Square
SSR
MSR
MSE
SSE
( n ( k 1))
MST
SST
( n 1)
F Ratio
=0.01
F
F0.01=9.55
Thetest
teststatistic,
statistic,FF=
=86.34,
86.34,
The
greaterthan
thanthe
thecritical
critical
isisgreater
pointof
ofFF(2, 7)for
forany
anycommon
common
point
(2, 7)
levelof
ofsignificance
significance
level
(p-value0),
0),so
sothe
thenull
null
(p-value
hypothesisis
isrejected,
rejected,and
and
hypothesis
wemight
mightconclude
concludethat
thatthe
the
we
dependentvariable
variableisisrelated
related
dependent
toone
oneor
ormore
moreof
ofthe
the
to
independentvariables.
variables.
independent
MSE
x1
SSE
( n ( k 1))
( y y) 2
( n ( k 1))
s=
Errors: y - y
MSE
2
The multiple coefficient of determination, R , measures the proportion of
the variation in the dependent variable that is explained by the combination
of the independent variables in the multiple regression model:
R2 =
SSR
SSE
= 1SST
SST
SSE
2
SSR
SST
= 1-
SSE
SST
2
The adjusted multiple coefficient of determination , R , is the coefficient of
determination with the SSE and SST divided by their respective degrees of freedom:
SSE
R
= 1-
(n - (k + 1))
SST
(n - 1)
Example11-1:
11-1:
Example
1.911
ss==1.911
R-sq==96.1%
96.1%
R-sq
R-sq(adj)==95.0%
95.0%
R-sq(adj)
Sum of
Squares
Regression SSR
Degrees of
Freedom Mean Square
(k)
MSR
Error
SSE
(n-(k+1))
=(n-k-1)
Total
SST
(n-1)
SSR
SST
=1-
SSE
SST
MSE
F Ratio
F
SSR
k
MSR
MSE
SSE
( n ( k 1))
MST
SST
( n 1)
SSE
( n ( k 1))
2
(1 R )
(k )
=1-
(n - (k + 1))
SST
(n - 1)
MSE
MST
b 0
s(b )
i
( n ( k 1 )
Coefficient
Estimate
Standard
Error
t-Statistic
53.12
5.43
9.783
X1
2.03
0.22
9.227
X2
5.60
1.30
4.308
X3
10.35
6.88
1.504
X4
3.45
2.70
1.259
X5
-4.25
0.38
11.184
n=150
t0.025=1.96
*
*
*
*
Regression
line without
outlier
. .
. ..
..
. .. ..
.. .
.
Regressio
n line
with
outlier
* Outlier
x
Outliers
Outliers
Point with a
large value of
xi
.
.
.
.
.. .... .
. .. .
Regression
line when all
data are
included
No
relationship in
this cluster
InfluentialObservations
Observations
Influential
Fit Stdev.Fit
Stdev.Fit
Fit
2.6420 0.1288
0.1288
2.6420
2.6438 0.1234
0.1234
2.6438
4.5949 0.0676
0.0676
4.5949
4.6311 0.0651
0.0651
4.6311
5.1317 0.0648
0.0648
5.1317
4.9474 0.0668
0.0668
4.9474
Residual
St.Resid
Residual
St.Resid
-0.0420
-0.14XX
-0.0420
-0.14
-0.0438
-0.14XX
-0.0438
-0.14
0.9051
2.80R
0.9051
2.80R
-0.9311
-2.87R
-0.9311
-2.87R
-0.8317
-2.57R
-0.8317
-2.57R
0.6526
2.02R
0.6526
2.02R
denotesan
anobs.
obs.with
withaalarge
largest.
st.resid.
resid.
RRdenotes
denotesan
anobs.
obs.whose
whoseXXvalue
valuegives
givesititlarge
largeinfluence.
influence.
XXdenotes
EstimatedRegression
RegressionPlane
Planefor
forExample
Example11-1
11-1
Estimated
89.76
Advertising
18.00
63.42
8.00
Promotions
12
Prediction in Multiple
Regression
A (1 - a) 100% prediction interval for a value of Y given values of X :
i
y t
( ,( n ( k 1)))
2
s 2 ( y) MSE
y t
( ,( n ( k 1)))
2
s[ E (Y )]
EXAMPLE 11-3
Picturing Qualitative
Variables in Regression
y
Line for
X2=1
b0+
b2
b3
Line for
X2=0
x1
X1
regressionwith
withone
one
AAregression
quantitativevariable
variable(X
(X1))
quantitative
1
andone
onequalitative
qualitative
and
variable(X
variable
y b(X22):):b x b x
0
x2
multipleregression
regressionwith
with
AAmultiple
twoquantitative
quantitativevariables
variables
two
(X1and
andXX2))and
andone
one
(X
1
2
qualitative
variable
(X
): x
3
qualitative
(X
):
y b0 b1 xvariable
b
x
b
3
1
2 2
3 3
b0+
b3
b0+
b2
b0
X1
regressionwith
withone
onequantitative
quantitativevariable
variable
AAregression
(X1))and
andtwo
twoqualitative
qualitativevariables
variables(X
(X2and
andXX2):):
(X
1
2
2
y b b x b x b x
0
qualitative
AAqualitative
variablewith
with rr
variable
levelsor
or
levels
categoriesisis
categories
represented
represented
with(r-1)
(r-1) 0/1
0/1
with
(dummy)
(dummy)
variables.
variables.
Category
Category
Adventure
Adventure
Drama
Drama
Romance
Romance
XX22
00
00
11
XX33
00
11
00
Gender
0 if Male
$3256below
belowmale
malesalaries
salaries
$3256
Interactions between
Quantitative and Qualitative
Variables: Shifting Slopes
Line for X2=0
Line for
X2=1
Slope =
b1
b0
Slope =
b1+b3
b0+b
X1
regressionwith
withinteraction
interactionbetween
betweenaa
AAregression
quantitativevariable
variable(X
(X1))and
andaaqualitative
qualitative
quantitative
1
variable(X
(X2):):
variable
2
y b b x b x b x x
0
y b b X
0
y b b X b X
0
y b b X b X b X
2
(b 0)
X1
X1
Polynomial Regression:
Example 11-5
Polynomial Regression:
Other Variables and CrossProduct Terms
Variable Estimate
Estimate Standard
StandardError
Error
Variable
2.34
0.92
2.54
XX11
2.34
0.92
2.54
3.11
1.05
2.96
XX22
3.11
1.05
2.96
2
4.22
1.00
4.22
XX121
4.22
1.00
4.22
2
3.57
2.12
1.68
XX222
3.57
2.12
1.68
2
2.77
2.30
1.20
2
1X
XX1X
2.77
2.30
1.20
T-statis
T-statist
Transformations:
Exponential Model
The exponential model:
Y e
1X
X log
1
Plots of Transformed
Variables
Sim ple Regression of Sales on Ad vertising
20
10
SALES
SALES
30
15
Y = 3.6 6 8 2 5 + 6 .78 4 X
R- Sq uared = 0 .978
5
10
15
ADVERT
LOGADV
3.5
2.5
Y = 1.70 0 8 2 + 0 .5 53 13 6 X
R- Sq uared = 0 .9 47
RESIDS
LOGSALE
0.5
-0.5
-1.5
1.5
0
LOGADV
12
Y-HAT
22
Variance Stabilizing
Transformations
Square root
root transformation:
transformation:
Y Y
Square
Useful
Usefulwhen
whenthe
thevariance
varianceof
ofthe
theregression
regressionerrors
errorsisis
approximatelyproportional
proportionalto
tothe
theconditional
conditionalmean
mean
approximately
ofYY
of
Y log(Y )
Logarithmic transformation:
transformation:
Logarithmic
Useful
Usefulwhen
whenthe
thevariance
varianceof
ofregression
regressionerrors
errorsisis
approximatelyproportional
proportionalto
tothe
the
squareof
ofthe
the
approximately
1 square
Y
conditionalmean
meanof
ofYY
conditional
Y
Reciprocal transformation:
transformation:
Reciprocal
Useful
Usefulwhen
whenthe
thevariance
varianceof
ofthe
theregression
regressionerrors
errorsisis
approximatelyproportional
proportionalto
tothe
thefourth
fourthpower
powerof
ofthe
the
approximately
conditionalmean
meanof
ofYY
conditional
1 p
p log
Logistic Function
11-11: Multicollinearity
x2
x2
x1
Orthogonal X variables
provide information from
independent sources. No
multicollinearity.
x2
x1
Some degree of
collinearity. Problems with
regression depend on the
degree of collinearity.
x1
Perfectly collinear X
variables provide identical
information content. No
regression.
x2
x1
Effects of Multicollinearity
Variancesof
ofregression
regressioncoefficients
coefficientsare
areinflated.
inflated.
Variances
Magnitudesof
ofregression
regressioncoefficients
coefficientsmay
maybe
bedifferent
differentfrom
from
Magnitudes
whatare
areexpected.
expected.
what
Signsof
ofregression
regressioncoefficients
coefficientsmay
maynot
notbe
beas
asexpected.
expected.
Signs
Addingor
orremoving
removingvariables
variablesproduces
produceslarge
largechanges
changesin
in
Adding
coefficients.
coefficients.
Removingaadata
datapoint
pointmay
maycause
causelarge
largechanges
changesin
incoefficient
coefficient
Removing
estimatesor
orsigns.
signs.
estimates
Insome
somecases,
cases,the
theFFratio
ratiomay
maybe
besignificant
significantwhile
whilethe
thettratios
ratios
In
arenot.
not.
are
VIF100
50
0
0.0
0.5
1.0
Rh2
Solutions to the
Multicollinearity Problem
Drop aa collinear
collinear variable
variable from
from the
the
Drop
regression
regression
Change in
in sampling
sampling plan
plan to
to include
include
Change
elements outside
outside the
the
elements
multicollinearity range
range
multicollinearity
Transformations of
of variables
variables
Transformations
Ridge regression
regression
Ridge
11-12 Residual
Autocorrelation and the
Durbin-Watson
Test
An autocorrelation is a correlation of the values of a variable with
LaggedResiduals
Residuals
Lagged
ii
11
22
33
44
55
66
77
88
99
10
10
i i i-1i-1
i-2i-2 i-3i-3
1.0
1.0
**
**
**
0.0
1.0
0.0
1.0
**
**
-1.0
0.0
1.0 **
-1.0
0.0
1.0
2.0 -1.0
-1.0
0.0 1.0
1.0
2.0
0.0
3.0
2.0 -1.0
-1.0 0.0
0.0
3.0
2.0
-2.0
3.0
2.0 -1.0
-1.0
-2.0
3.0
2.0
1.0 -2.0
-2.0
3.0 2.0
2.0
1.0
3.0
1.5
1.0 -2.0
-2.0 3.0
3.0
1.5
1.0
1.0
1.5
1.0 -2.0
-2.0
1.0
1.5
1.0
-2.5
1.0
1.5 1.0
1.0
-2.5
1.0
1.5
i-4i-4
**
**
**
**
1.0
1.0
0.0
0.0
-1.0
-1.0
2.0
2.0
3.0
3.0
-2.0
-2.0
TheDurbin-Watson
Durbin-Watson test
test
The
(first-orderautocorrelation):
autocorrelation):
(first-order
=00
HH00::11=
:0
0
HH11:
TheDurbin-Watson
Durbin-Watson test
test
The
statistic:n
2
statistic:
( ei ei 1 )
d i2 n
2
ei
i 1
=11 kk=
=22 kk=
=33 kk=
=44
=55
kk=
kk=
nn ddLL ddUU ddLL ddUU ddLL ddUU ddLL ddUU ddLL ddUU
15 1.08
1.08 1.36
1.36 0.95
0.95 1.54
1.54 0.82
0.82 1.75
1.75
15
16 1.10
1.10 1.37
1.37 0.98
0.98 1.54
1.54 0.86
0.86 1.73
1.73
16
17 1.13
1.13 1.38
1.38 1.02
1.02 1.54
1.54 0.90
0.90 1.71
1.71
17
18 1.16
1.16 1.39
1.39 1.05
1.05 1.53
1.53 0.93
0.93 1.69
1.69
18
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
65 1.57
1.57 1.63
1.63 1.54
1.54 1.66
1.66 1.50
1.50 1.70
1.70
65
70 1.58
1.58 1.64
1.64 1.55
1.55 1.67
1.67 1.52
1.52 1.70
1.70
70
75 1.60
1.60 1.65
1.65 1.57
1.57 1.68
1.68 1.54
1.54 1.71
1.71
75
80 1.61
1.61 1.66
1.66 1.59
1.59 1.69
1.69 1.56
1.56 1.72
1.72
80
85 1.62
1.62 1.67
1.67 1.60
1.60 1.70
1.70 1.57
1.57 1.72
1.72
85
90 1.63
1.63 1.68
1.68 1.61
1.61 1.70
1.70 1.59
1.59 1.73
1.73
90
95 1.64
1.64 1.69
1.69 1.62
1.62 1.71
1.71 1.60
1.60 1.73
1.73
95
1001.65 1.69
1.69 1.63
1.63 1.72
1.72 1.61
1.61 1.74
1.74
1001.65
0.69
0.69
0.74
0.74
0.78
0.78
0.82
0.82
1.97
1.97
1.93
1.93
1.90
1.90
1.87
1.87
0.56
0.56
0.62
0.62
0.67
0.67
0.71
0.71
2.21
2.21
2.15
2.15
2.10
2.10
2.06
2.06
1.47
1.47
1.49
1.49
1.51
1.51
1.53
1.53
1.55
1.55
1.57
1.57
1.58
1.58
1.59
1.59
1.73
1.73
1.74
1.74
1.74
1.74
1.74
1.74
1.75
1.75
1.75
1.75
1.75
1.75
1.76
1.76
1.44
1.44
1.46
1.46
1.49
1.49
1.51
1.51
1.52
1.52
1.54
1.54
1.56
1.56
1.57
1.57
1.77
1.77
1.77
1.77
1.77
1.77
1.77
1.77
1.77
1.77
1.78
1.78
1.78
1.78
1.78
1.78
Positive
Autocorrelation
dL
Test is
Inconclusive
dU
No
Autocorrelation
Test is
Inconclusive
4-dU
Negative
Autocorrelation
4-dL
Fornn=
=67,
67,kk=
=4:
4: ddU1.73
1.73
4-dU2.27
2.27
For
4-d
U
U
4-dd2.53
<2.58
2.58
L1.47
L2.53<
ddL1.47
4L
rejected,and
andwe
weconclude
concludethere
thereisisnegative
negativefirst-order
first-order
HH00isisrejected,
autocorrelation.
autocorrelation.
Fullmodel:
model:
Full
= 0+
+ 1XX1+
+ 2XX2+
+ 3XX3+
+ 4XX4+
+
YY=
0
1
1
2
2
3
3
4
4
Reducedmodel:
model:
Reduced
= 0+
+ 1XX1+
+ 2XX2+
+
YY=
0
1
1
2
2
PartialFFtest:
test:
Partial
H0:: 3=
= 4=
=00
H
0
3
4
H1:: 3and
and 4not
notboth
both00
H
1
3
4
PartialFFstatistic:
statistic: F
Partial
(r, (n (k 1))
(SSE
SSE ) / r
R
F
MSE
F
whereSSE
SSERis
isthe
thesum
sumof
ofsquared
squarederrors
errorsof
ofthe
thereduced
reducedmodel,
model,SSE
SSEF
where
R
F
is
the
sum
of
squared
errors
of
the
full
model;
MSE
is
the
mean
F
is the sum of squared errors of the full model; MSEF is the mean
squareerror
errorof
ofthe
thefull
fullmodel
model[MSE
[MSEF==SSE
SSEF/(n-(k+1))];
/(n-(k+1))];rris
isthe
the
square
F
F
numberof
ofvariables
variablesdropped
droppedfrom
fromthe
thefull
fullmodel.
model.
number
Run
Run regressions
regressions with
with all
all possible
possible
combinations of
of independent
independent variables
variables and
and
combinations
select best
best model
model
select
A p-value of 0.001
indicates that we
should reject the null
hypothesis H0: the
slopes for Lend and
Exch. are zero.
Addone
onevariable
variableat
ataatime
timeto
tothe
themodel,
model,on
onthe
thebasis
basisof
of
Add
itsFFstatistic
statistic
its
Backward
Backward elimination
elimination
Removeone
onevariable
variableat
ataatime,
time,on
onthe
thebasis
basisof
ofits
itsFF
Remove
statistic
statistic
Stepwise
Stepwise regression
regression
Addsvariables
variablesto
tothe
themodel
modeland
andsubtracts
subtractsvariables
variables
Adds
fromthe
themodel,
model,on
onthe
thebasis
basisof
ofthe
theFFstatistic
statistic
from
Stepwise Regression
Compute F statistic for each variable not in the model
No?
Is there at least one variable with p-value > P
in
Stop
Yes
Enter most significant (smallest p-value) variable into model
Remove
variable
4.00 F-to-Remove:
F-to-Remove:
4.00
4.00
4.00
ResponseisisEXPORTS
EXPORTS on
on 44predictors,
predictors,with
withNN== 67
67
Response
Step
Step
Constant
Constant
M1
M1
T-Ratio
T-Ratio
11
0.9348
0.9348
22
-3.4230
-3.4230
0.520
0.520
9.89
9.89
0.361
0.361
9.21
9.21
PRICE
PRICE
T-Ratio
T-Ratio
SS
R-Sq
R-Sq
0.0370
0.0370
9.05
9.05
0.495
0.495
60.08
60.08
0.331
0.331
82.48
82.48
Coef
Coef
-4.015
-4.015
0.36846
0.36846
Stdev
Stdev
2.766
2.766
0.06385
0.06385
t-ratio
t-ratio
-1.45
-1.45
5.77
5.77
pp
0.152
0.152
0.000
0.000
0.00470
0.00470
0.04922
0.04922
0.10
0.10
0.924
0.924
0.036511
0.036511
0.009326
0.009326
3.91
3.91
0.000
0.000
0.268
0.268
1.175
1.175
R-sq==82.5%
82.5%
R-sq
0.23
0.23
0.820
0.820
R-sq(adj)==81.4%
81.4%
R-sq(adj)
AnalysisofofVariance
Variance
Analysis
SOURCE
SOURCE
pp
Regression
Regression
0.000
DF
DF
44
SS
SS
32.9463
32.9463
MS
MS
8.2366
8.2366
FF
73.06
73.06
DF
DF
11
11
11
Variable
Variable
INTERCEP
INTERCEP
M1
M1
LEND
LEND
PRICE
PRICE
EXCHANGE
EXCHANGE
-4.015461
2.76640057
-1.452
-4.015461
2.76640057
-1.452
0.368456
0.06384841
5.771
0.368456
0.06384841
5.771
0.004702
0.04922186
0.096
11
0.004702
0.04922186
0.096
0.036511
0.00932601
3.915
11
0.036511
0.00932601
3.915
0.267896
1.17544016
0.228
0.267896
1.17544016
0.228
DF
DF
11
11
11
Parameter
Standard
forH0:
H0:
Parameter
Standard
TTfor
Estimate
Error
Parameter=0
Prob>>|T|
|T|
Estimate
Error
Parameter=0
Prob
Variance
Variance
Inflation
Inflation
0.00000000
0.00000000
3.20719533
3.20719533
5.35391367
11
5.35391367
6.28873181
11
6.28873181
1.38570639
1.38570639
Durbin-WatsonDD
Durbin-Watson
(ForNumber
NumberofofObs.)
Obs.)
(For
1stOrder
OrderAutocorrelation
Autocorrelation
1st
2.583
2.583
67
67
-0.321
-0.321
0.1517
0.1517
0.0001
0.0001
0.9242
0.9242
0.0002
0.0002
0.8205
0.8205
y
.
.
y
1
1
1
.
.
.
1
x
x
x
.
.
.
x
11
21
31
n1
x
x
x
.
.
.
x
12
22
32
n2
x ... x
x ... x
x ... x
.
.
.
.
.
.
.
.
.
x ... x
13
23
2k
33
3k
n3
1k
nk
Y X
The estimated regression model:
Y = Xb + e
.
.
Predicted values:
Y Xb X ( X X ) X Y HY
V (b) ( X X )
s (b) MSE ( X X )
1