Regression Models Notes

Regression Models Notes
Ramesh Kadambi
June 26, 2014
2
Contents
1 Some Statistical Terms 5
1.1 Data and the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Data and the Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Facts About Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Linear Regression 7
2.1 Regression Through Origin (RTO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 General Regression Fitting the best line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Consequences of Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Regression to the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Regression Model with Additive Gausian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Interpretation of Regression Coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Interpreting the Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 Interpreting the Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Example Working with Diamond Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Residuals and Residual Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8.1 Properties of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Nonlinear Data and Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 Data with changing variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3
4 CONTENTS
Chapter 1
Some Statistical Terms
1.1 Data and the Mean
Given a set of random data points X = {x
i
: i = 1 to n}.
One would like to nd a number corresponding to the middle
of the data points. The middle as intuition would tell us is
the point that is at the shortest distance from all the points
in our data set, i.e. a point x, that minimizes
n
1
(x
i
x)
2
.
Minimizing the error function we have,
d
dx
n
1
(x
i
x)
2
= 0
2
n
1
(x x) = 0
nx =
n
1
x
i
x =
1
n
n
1
x
i
1.2 Data and the Variance
Once we know where the center of the data set lies, we would
like to know how the points are distributed around the cen-
ter. The distribtution/dispersion of the data around the
mean is given by the variance or the
variance called the

standard deviation (). Note that the mean is the value that
minimizes variance. The unbiased estimate of the variance
is given by,
S
2
=
1
n 1
n
1
(x
i
x)
2
= S =
_
1
n 1
n
1
(x
i
x)
2
=
_
1
n 1
n
1
_
x
2
i
2x
i
x + x
2
_
=
_
1
n 1
_
n
1
x
2
i
2x
n
1
x
i
+ nx
2
_
=
_
1
n 1
_
n
1
x
2
i
2nx
2
+ nx
2
_
=
_
1
n 1
_
n
1
x
2
i
nx
2
_
1.3 Normalization
Transforming the given data by subtracting the mean and
dividing by the standard deviation is called normalization.
Normailzing leaves the data with a mean of 0 and a stan-
dard deviation of 1. Normalized data are in units of stan-
dard deviation from the mean. The value of the data point
represents the number of standard deviations from the mean
the point occured.
1.4 Covariance
Given two sets of random data, X = {x
i
: i = 1 to n} and
Y = {y
j
: j = 1 to n}, the covariance is dened to measure
5
6 CHAPTER 1. SOME STATISTICAL TERMS
the relationship between the two random variables.
Cov(X, Y ) =
1
n 1
n
1
(x
i
x) (y
i
y)
=
1
n 1
n
1
(x
i
y
i
x
i
y xy
i
+ xy)
=
1
n 1
_
n
1
x
i
y
i
2nxy + nxy
_
=
1
n 1
_
n
1
x
i
y
i
nxy
_
1.5 Correlation
Correlation is a dimensionless quantity dened as the ratio
of covariance to the product of the standard deviaiton of the
covariant variables.
Cor(X, Y ) =
Cov(X, Y )
y
1.5.1 Facts About Correlation
1. Cor(X, Y ) = Cor(Y, X), this follows from the fact
that Cov(X, Y ) = Cov(Y, X).
2. 1 Cor(X, Y ) 1
3. Cor(X, Y ) = 1 or Cor(X, Y ) = 1, only when the
data X and Y fall perfectly on a positively sloped or
a negatively sloped line.
4. Cor(X, Y ) meausres the strength of the relationship
between X and Y . The relationship is stronger as
Cor(X, Y ) {1, 1}.
5. Cor(X, Y ) = 0 implies no linear relationship.
Chapter 2
Linear Regression
2.1 Regression Through Origin
(RTO)
Given two sets of random data X = {x
i
: i = 1 to n}
and Y = {y
j
: j = 1 to n} we would like to use x
i
to predict the value of y
i
. The idea is to nd if there is
a relationship between the given random variables. We will
nd out that the releationship is dependent on the correla-
tion of the two random variates. Our objective is to nd
that minimizes the relation,
n
1
(y
i
x
i
)
2
minimizing the error function we have
d
d
n
1
(y
i
x
i
)
2
= 0
2
n
1
(y
i
x
i
)x
i
= 0
1
x
2
i
=
n
1
x
i
y
i
=
n
1
x
i
y
i
n
1
x
2
i
2.2 General Regression Fitting the
best line
Given two sets of Random variates X = {x
i
: i = 1 to n}
and Y = {y
j
: j = 1 to n} we would like to use x
i
to predict
the value of y
i
as before. Unlike RTO, here we t a straight
line with an intercept. Our model function is of the form
y =
0
+
1
x
. As before we would like to minimize the error,
n
1
(y
i
(
0
+
1
x
i
))
2
. Solving the multivariate minimization problem we have,
0
n
1
(y
i
(
0
+
1
x
i
))
2
= 0
1
n
1
(y
i
(
0
+
1
x
i
))
2
= 0
carrying out the minimization in each dimension we have,
0
n
1
(y
i
(
0
+
1
x
i
))
2
= 0
2
n
1
(y
i
1
x
i
) = 0
n
0
=
n
1
y
i
1
n
1
x
i
0
=
1
n
_
n
1
y
i
1
n
1
x
i
_
0
= y
1
x
Similarly solving for
1
we have,
1
n
1
(y
i
(
0
+
1
x
i
))
2
= 0
2
n
1
(y
i
i
x
i
)x
i
= 0
n
1
y
i
x
i
0
n
1
x
i
1
n
1
x
2
i
= 0
multiplying by
1
n
and substituting for
0
we have
1
n
_
n
1
x
i
y
i
(y
1
x)
n
1
x
i
1
n
1
x
2
i
_
= 0
1
n
_
n
1
x
i
y
i
(y
1
x)nx
1
n
1
x
2
i
_
= 0
1
n
_
n
1
x
i
y
i
nxy
_
1
1
n
_
n
1
x
2
i
nx
2
_
= 0
1
=
Cov(X, Y )
2
x
=
Cov(X, Y )
x
=
xy
x
Table 1: Summary of Regression
Regression through Origin Model: y = x
=
n
1
xiyi
n
1
x
2
i
Regression with Intercept Model: y =
0
+
1
x
0
= y
1
x
1
=
y
x
xy
7
8 CHAPTER 2. LINEAR REGRESSION
2.3 Consequences of Linear Regres-
sion
Given the linear model y =
0
+
1
x subject to the mini-
mization problem of
n
1
(y
i
(
0
+
1
x))
2
we have,
1.
1
has the units of
X
Y
,
0
has the units of Y .
2. The regression line passes through (x, y), which is clear
from the fact that, if x = x then y = y
1
x+
1
x = y.
3. The slope is the same as that obtained by tting the
line through the origin using cetnered data (demeaned
data, ie. using x
i
x as the predictor).
4. Flipping the predictor just changes the slope by
switching the ratio of the standard deviations, ie.
y
x
to
x
y
or vice versa.
5. If the data is normalized {
xix
x
,
yiy
y
}, the slope is the
correlation between the random variates.
2.4 Regression to the mean
1
Regression to the mean is the phenomenon observed in lin-
ear regression and could be generalized to any data with
bivariate distribution. Given the model y =

0
+

1
x, we
have
y =

0
+

1
x
y = y

1
x +

1
x
y y =

1
(x x)
y y
y
=
xy
x x
x
(2.4.1)
In (2.4.1) we see that if 1 <
xy
< 1, then
yy
y
<
xx
x
.
The predicted (or tted) standardized value of y is closer
to its mean than the standardized value of x is to its mean.
In terms of probability, P(Y < y|X = x) gets bigger as x
heads in to large values. Similarly P(Y > y|X = x) gets
bigger as x heads to very small values.
2.5 Regression Model with Addi-
tive Gausian Noise
Given two random variates X, Y , linear regression builds a
model of the form y =
0
+
1
x. The above model com-
putes the coecients of the linear model by formulating a
mathematical problem and minimizing an error function. A
statistical formulation of linear regression would include an
error term that is random. The estimate of the coecients
will be based on a maximum likelihood estimate. Our
estimated model would be:
y =

0
+

1
x +
here,
1. are assumed to be iid N(0,
2
).
2. E[Y = y
i
|X = x
i
] =
i
=

0
+

1
x
i
3. V ar(Y = y
i
|X = x
i
) =
2
It would be interesting to dwell on this for a moment. We
are given two random variate {X, Y }. It is our belief that
Y and X are related and there exists a linear function
f : X Y . For a given value of X = x
i
, the values y
i
are random and have an expected value of

0
+

1
x
i
and a
variance of
2
. Graphically it would look as below,
2
Plot of y
i
|x = .17
0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24
3
2
0
3
3
0
3
4
0
3
5
0
xpl
y
[
x

=
=

0
.
1
7
]
Our objective is to estimate the parameter set = {
0
,

1
},
given the values of y
i
have an additive noise = N(0,
2
).
From the relation y
i
=

0
+

1
x
i
+
i
, we see that the y
i
are distributed with the density function N(
i
,
2
) where
i
= E[y
i
] =

0
+

1
x
i
. Since we assume
i
are iid, y
i
are
independent though not identically distributed (due to
i
).
The joint density of y
i
is given by,
L(

0
,

1
) =
n
1
N(
i
,
2
)
=
n
1
1
2
2
e
1
2
2
(yii)
2
1
Note theis used to indicate estimated values. Though we know that observed values are true, our model coecients are estimated values.
2
This is from the package usingR data(diamond). We have plotted prices for a diamond with carat value of .17.
2.6. INTERPRETATION OF REGRESSION COEFFICIENTS 9
simplifying the log likelihood function we have,
log(L(

0
,

1
)) =
n
1
log(
1
2
2
e
1
2
2
(yii)
2
)
=
n
1
log 1 log(
2
2
)
1
2
2
(y
i
i
)
2
=
n
1
1
2
(log 2 + log + 2 log )
1
2
2
(y
i
i
)
2
= c 2nlog
1
2
2
n
1
(y
i
i
)
2
The paramerters are arrived at by maximizing the likelihood
function L(
0
,

1
). The function we are maximizing is the
same as the negative of the function we minimized for least
squares regression, i.e
n
1
(y
i
i
)
2
. The estimate under the
assumption of gaussian error is therefore the same as the
estimate for the linear least squares.
2.6 Interpretation of Regression
Coecients
The coecients of linear regression can be interpreted to
give an intuitive meaning.
2.6.1 Interpreting the Intercept
0
is the expected value of the response when the predictor
has the value 0. This follows from the fact that,
y =

0
+

1
x +
E[ y|x = 0] =

0
The above interpretation may not be meaningful in cases
where x = 0 is not a valid value for the predictor. In such
cases we would shift or center the predictor to get a proper
interpretation.
y =

0
+

1
a +

1
(x a) +
=

s
+

1
(x a) +
Shifting the data points x
i
only shifts the intercept and does
not aect the slope. Assigning a = x, the intercept will be
the prediction/response for the value x = x.
2.6.2 Interpreting the Slope
The slope can be interpreted a couple of dierent ways,
1. The slope is interpreted as the expected re-
sponse/prediction for a unit change in the predictor.
E[ y|x = x
i
+ 1] E[ y|x = x
i
]
=

0
+

1
(x + 1)

1
x
=

1
2. Scaling the units of x results in dividing the slope by
the same scaling factor.
y =

0
+ a
1
a
x +
y =

0
+ a
f
x +
The interpretation of the slope is intuitive in the sense that
the units of the slope

1
are
units of y
units of x
. If we now scale the
units of x and use the scaled value as my predictor. The
slope of the resulting regression model will be appropriately
scaled when estimated.
2.7 Example Working with Dia-
mond Prices
3
The regression results are given below,
> lrwithintercept(diamond$carat, diamond$price)
$x
[1] 0.17 0.16 0.17 0.18 0.25 0.16 0.15
[8] 0.19 0.21 0.15 0.18 0.28 0.16 0.20
[15] 0.23 0.29 0.12 0.26 0.25 0.27 0.18
[22] 0.16 0.17 0.16 0.17 0.18 0.17 0.18
[29] 0.17 0.15 0.17 0.32 0.32 0.15 0.16
[36] 0.16 0.23 0.23 0.17 0.33 0.25 0.35
[43] 0.18 0.25 0.25 0.15 0.26 0.15
$y
[1] 355 328 350 325 642 342 322
[8] 485 483 323 462 823 336 498
[15] 595 860 223 663 750 720 468
[22] 345 352 332 353 438 318 419
[29] 346 315 350 918 919 298 339
[36] 338 595 553 345 945 655 1086
[43] 443 678 675 287 693 316
$intercept
[1] -259.6259
$slope
[1] 3721.025
3
Using data(diamond) from the package UsingR
Figure 2: Regression of diamond wt vs price
0.15 0.20 0.25 0.30 0.35
2
0
0
6
0
0
1
0
0
0
diamond$carat
d
i
a
m
o
n
d
$
p
r
i
c
e
The intercept, zero carat diamond is worth -259$ which
does not make sense as there are none. Centering the dia-
mond weights at the mean we have the following,
4
> res <- lrwithintercept(diamond$carat
- mean(diamond$carat), diamond$price)
> res
$x
[1] -0.034166667 -0.044166667
[3] -0.034166667 -0.024166667
[5] 0.045833333 -0.044166667
[7] -0.054166667 -0.014166667
[9] 0.005833333 -0.054166667
[11] -0.024166667 0.075833333
[13] -0.044166667 -0.004166667
[15] 0.025833333 0.085833333
[17] -0.084166667 0.055833333
[19] 0.045833333 0.065833333
[21] -0.024166667 -0.044166667
[23] -0.034166667 -0.044166667
[25] -0.034166667 -0.024166667
[27] -0.034166667 -0.024166667
[29] -0.034166667 -0.054166667
[31] -0.034166667 0.115833333
[33] 0.115833333 -0.054166667
[35] -0.044166667 -0.044166667
[37] 0.025833333 0.025833333
[39] -0.034166667 0.125833333
[41] 0.045833333 0.145833333
[43] -0.024166667 0.045833333
[45] 0.045833333 -0.054166667
[47] 0.055833333 -0.054166667
$y
[1] 355 328 350 325 642 342 322
[8] 485 483 323 462 823 336 498
[15] 595 860 223 663 750 720 468
[22] 345 352 332 353 438 318 419
[29] 346 315 350 918 919 298 339
[36] 338 595 553 345 945 655 1086
[43] 443 678 675 287 693 316
$intercept
[1] 500.0833
$slope
[1] 3721.025
> mean(diamond$carat)
[1] 0.2041667
The intercept of the centered data is 500.0833, according to
our interpretation it is the price of the diamond of weight
.2041667. The intercept is indicated by the red point in
Figure 3.
Figure 3: Centered regression and intercept interpretation
-0.05 0.00 0.05 0.10 0.15
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
res$x
r
e
s
$
y
The slope of the regression line represents a unit change in
price for a unit change in size of the diamond. Our intercept
indicates that a 1 carat change in the size of the diamond
costs 3721$. Scaling the size of the diamonds to
1
10
th
carat,
we have the following result,
> res <- lrwithintercept(diamond$carat * 10,
diamond$price)
> res
$x
[1] 1.7 1.6 1.7 1.8 2.5 1.6 1.5 1.9 2.1
[10] 1.5 1.8 2.8 1.6 2.0 2.3 2.9 1.2 2.6
[19] 2.5 2.7 1.8 1.6 1.7 1.6 1.7 1.8 1.7
[28] 1.8 1.7 1.5 1.7 3.2 3.2 1.5 1.6 1.6
[37] 2.3 2.3 1.7 3.3 2.5 3.5 1.8 2.5 2.5
[46] 1.5 2.6 1.5
$y
[1] 355 328 350 325 642 342 322
[8] 485 483 323 462 823 336 498
[15] 595 860 223 663 750 720 468
[22] 345 352 332 353 438 318 419
4
This is as expected, translating the line does not change the slope but the intercept. We have just moved the line across the x-axis so that
the origin corresponds to the mean size of the diamond.
2.8. RESIDUALS AND RESIDUAL VARIATION 11
[29] 346 315 350 918 919 298 339
[36] 338 595 553 345 945 655 1086
[43] 443 678 675 287 693 316
$intercept
[1] -259.6259
$slope
[1] 372.1025
As expected scaling the size of the diamonds by a factor of
10 (same as changin the unit of measurement by
1
10
th
carat),
reduces the slope to 372 $/carat.
To predict the value of a diamond of a given size we just use
the results of the regression and value it using the slope and
the intercept.
> predictlm(res, 10 * c(.16,.27,.34))
[1] 335.7381 745.0508 1005.5225
Figure 4: Scaled Regression with Predictions
1.5 2.0 2.5 3.0 3.5
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
res$x
r
e
s
$
y
2.8 Residuals and Residual Varia-
tion
Our model y =
0
+
1
x + is expected to represent the
observed values y
i
=
0
+
1
x
i
+
i
, where
i
= N(0,
2
).
However, we do not expect to predict the values exactly.
What we hope to predict is some kind of average value of
the observed variable at a given value of the predictor vari-
able. Our regression estimate y
i
=

0
+

1
x
i
provides an
estimate of the observed values y
i
. We obtain the error
e
i
= y
i
y
i
. Note that e
i
are not the same as
i
. The e
i
are
called the residuals. The least squares regression minimizes
n
1
e
2
i
.
2.8.1 Properties of Residuals
1. The expected value of the residuals is zero, ie. E[e
i
]
= 0.
2. If an intercept is included then
n
1
e
i
= 0. This indi-
cates that the residuals are not independent. This can
be generalized to the fact that if we t p parameters
in the linear model then, knowledge of n p residuals
is sucient to compute the remaining p residuals.
3. If a regressor variable, x
i
, is included in the model
then
n
1
x
i
e
i
= 0
4. Residuals are useful for investigating poor model t.
5. Positive residuals are above the line and negative resid-
uals are below the line (regression).
6. Residuals can be thought of as the outcome with the
linear association of the prediction removed.
7. One dierentiates residual variation (variation after
the predictor is removed) from the systematic vari-
ation (variation explained by the regression model).
8. Residual plots highlight poor model t.
Figure 5: Plot of residues
1.5 2.0 2.5 3.0 3.5
-
5
0
0
5
0
res$x
r
e
s
$
e
2.9 Nonlinear Data and Linear Re-
gression
We will generate random non-linear data as,
x <- runif(100, -3, 3);
y <- x + sin(x) + rnorm(100, sd = .2)
The plots of the regression and residuals are shown below.
One can clearly see a pattern in the residuals. In such
situations, the data can be transformed to obtain a linear
relationship.
Figure 6: Plot of nonliear data
-3 -2 -1 0 1 2
-
3
-
2
-
1
0
1
2
3
x
y
Figure 7: Plot of residuals nonliear data
-3 -2 -1 0 1 2
-
0
.
5
0
.
0
0
.
5
1
.
0
x
r
e
s
_
s
i
n
s
i
m
$
e
2.10 Data with changing variance
We will generate random data that is heteroskdastic using
the following code,
> x <- = c(seq(-3,3,1), seq(-3,3,1),
seq(-3,3,1), seq(-3,3,1),seq(-3,3,1))
y <- x + rnorm(nrow(as.matrix(x), sd = abs(.2 + x))
Below are the plots of the regression as well as the residues.
As seen from the residue plot. The residue does not have
the same constant variance.
Figure 8: Plot of heteroscdastic data
-3 -2 -1 0 1 2 3
-
4
-
2
0
2
4
6
x
y
Figure 9: Plot of heteroscdastic residuals
-3 -2 -1 0 1 2 3
-
6
-
4
-
2
0
2
4
6
x
r
e
s
_
s
i
n
s
i
m
$
e
A better illustration is plotting a sample of y
i
against x
i
at a couple of dierent points as well as the corresponding
variance. The plot is as shown below. The points in green
and red are the mean and variance. As seen the y
i
do not
have a constant variance across x
i
.
> sdy
[1] 0.2887582 3.4365632
> meany
[1] -0.002637026 2.713918444
2.10. DATA WITH CHANGING VARIANCE 13
Figure 8: Illustration of heteroskadasticity
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-
2
0
2
4
6
xl
y
l

Regression Models Notes

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Regression Models Notes

Enviado por

Direitos autorais:

Formatos disponíveis

Regression Models Notes

variance called the

Você também pode gostar