LSR1-Fall2013

Linear Statistical Analysis I
Fall 2013
() Linear Statistical Analysis I Fall 2013 1 / 26
Outline
Simple Linear Regression
Ordinary Least Squares Estimation,
There is only one predictor X and one response Y.
Example: (Heights data) There are 1375 motherdaughter pairs.
Measurements on the heights of each pair were made.
> heights=read.table("heights.txt",header=T)
> heights
Mheight Dheight
1 59.7 55.1
2 58.2 56.5
3 60.6 56.0
...................
...................
1374 70.8 71
1375 63 73.1
A natural question is that whether mothers height will affect
daughters height. If yes, can we predict daughters height based
on mothers height. Hence,
X = mothers height, Y = daughters height
We examine the relationship between the two variables visually.
> X=heights$Mheight
> Y=heights$Dheight
> plot(X,Y, xlab="Mothers heights",ylab="Daughters heights")
55 60 65 70
5
5
6
0
6
5
7
0
Mother's heights
D
a
u
g
h
t
e
r
'
s

h
e
i
g
h
t
s
There is a linear trend between X and Y. That is, all the points are
distributed around the blue line.
55 60 65 70
5
5
6
0
6
5
7
0
Mother's heights
D
a
u
g
h
t
e
r
'
s

h
e
i
g
h
t
s
It is reasonable to assume that at least in the region X [55, 70],
there is a linear relationship between X and Y but due to other
factor or random effect, the linear relationship is not exact.
The linear regression model is
Y =
0
+
1
X + ,
where
0
and
1
are two unknown parameters: intercept and
slope, and has the mean equal to 0 and the variance equal to
2
.
Totallly, we have three parameters:
0
,
1
and
2
.
In order to completely determine the model, we have to specify
the values of them.
We do not know the true values of the parameters. All the
information about the two variables we have are the data (the
observations).
We can only nd the values of parameters based on the data.
The procedure is called estimation of parameters. The values we
nd are called estimates and denoted by

0
,

1
and

2
.
Of course, we hope the estimates are as close to the true values
as possible.
Before I introduce the method of estimation, I show you the effect
of
2
on the patterns of the data.
We will use simulation to study the effect.
simulation means that given a statistical model, we generate
data from the model by using random number generating methods
in statistical software.
Soppose we have a linear regression model:
Y = 1 + 2X + ,
where X N(5, 5
2
) and N(0,
2
).
We will simulate data from this model for = 0, 1, 5, 10, 30,
separately.
> X=rnorm(1000,5,5)
> e=rep(0,1000)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
0 2 4 6 8 10
5
1
0
1
5
2
0
X
Y
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,1)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,5)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
10 5 0 5 10 15 20
2
0
1
0
0
1
0
2
0
3
0
4
0
X
Y
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,10)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
10 5 0 5 10 15 20
4
0
2
0
0
2
0
4
0
6
0
X
Y
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,30)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
15 10 5 0 5 10 15 20
5
0
0
5
0
1
0
0
X
Y
With the increasing of , the linear patterns become weaker and
disappear eventually.
(ordinary) least squares estimation of parameters
The least squares method can go back to the 19th century when
people tried to calculate the orbits of heavenly bodies. The famous
mathematician, Carl Friedrich Gauss, published this method.
An early demonstration of the strength of the method came when
it was used to predict the future location of the newly discovered
asteroid Ceres.
Roughly speaking, this method esimates the parameters by
minimizing the sum of squares of the residuals which is the
difference between the values calculated from the equaltion and
the observations.
least squares estimation of parameters
Why does the least squares method provide good estimates, that
is, the estimates which are close the true values of parameters?
Although theoretical analysis can demonstrate that this method
can nd good estimates, we will not perform the analysis in this
class.
Instead, we will use an example to illustrate the effective of this
method.
Consider the linear regression model:
Y = 1 + 2X + ,
where X N(5, 5
2
) and N(0, 5
2
).
Hence, the true value of (
0
,
1
) is (1, 2), we will considers several
different values of (
0
,
1
) and calculate the sum of squares S and
compare them with that of the true value.
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,5)
> Y=1+2
*
X+e
> b0=1
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> S=sum((Y-Z)^2)
> S
[1] 23587.24
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
> b0=0
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="red",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 24748.12
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
> b0=-10
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> S=sum((Y-Z)^2)
> S
[1] 146356.9
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
> b0=1
> b1=1
> Z=b0+b1
*
X
> plot(X,Y)
> S=sum((Y-Z)^2)
> S
[1] 76990.82
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
> b0=1
> b1=-1
> Z=b0+b1
*
X
> plot(X,Y)
> S=sum((Y-Z)^2)
> S
[1] 501859.2
10 5 0 5 10 15 20
2
0
0
1
0
2
0
3
0
4
0
X
Y
We will calculate the least squares estimate by using the formula
and R command lm, separately, and compare the two results.
> sxy=cov(X,Y)
> b1=sxy/sxx
> b0=mean(Y)-b1
*
mean(X)
> b0
[1] 1.127409
> b1
[1] 1.991069
> fit=lm(Y~X)
> fit$coefficients
(Intercept) X
1.127409 1.991069
The estimate is close to the true value (1,2).
The lm is the main command to t a linear regression model.
The basic arguments are
lm(formula, data, ...)
There are many other arguments which will be introduced later.
The result of the command is a list which contains at least the
following components
coefficients a named vector of coefficients
residuals the residuals, that is response minus fitted values.
fitted.values the fitted mean values.
.................................
> plot(X,Y)
> lines(X,fit$fitted.values,col="red",lwd=3)
> lines(X,fit$fitted.values,col="red",lwd=3)
10 5 0 5 10 15 20
2
0
1
0
0
1
0
2
0
3
0
4
0
X
Y
We will calculate residuals
> sum(fit$residuals^2)
[1] 23578.75
which is slightly smaller the sum of squares, 23587.24, for the true
values.

LSR1-Fall2013

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

LSR1-Fall2013

Enviado por

Direitos autorais:

Formatos disponíveis

Linear Statistical Analysis I

Você também pode gostar