Você está na página 1de 26

Linear Statistical Analysis I

Fall 2013
() Linear Statistical Analysis I Fall 2013 1 / 26
Outline
Simple Linear Regression
Ordinary Least Squares Estimation,
() Linear Statistical Analysis I Fall 2013 2 / 26
Simple Linear Regression
There is only one predictor X and one response Y.
Example: (Heights data) There are 1375 motherdaughter pairs.
Measurements on the heights of each pair were made.
> heights=read.table("heights.txt",header=T)
> heights
Mheight Dheight
1 59.7 55.1
2 58.2 56.5
3 60.6 56.0
...................
...................
1374 70.8 71
1375 63 73.1
() Linear Statistical Analysis I Fall 2013 3 / 26
Simple Linear Regression
A natural question is that whether mothers height will affect
daughters height. If yes, can we predict daughters height based
on mothers height. Hence,
X = mothers height, Y = daughters height
We examine the relationship between the two variables visually.
> X=heights$Mheight
> Y=heights$Dheight
> plot(X,Y, xlab="Mothers heights",ylab="Daughters heights")
() Linear Statistical Analysis I Fall 2013 4 / 26
55 60 65 70
5
5
6
0
6
5
7
0
Mother's heights
D
a
u
g
h
t
e
r
'
s

h
e
i
g
h
t
s
() Linear Statistical Analysis I Fall 2013 5 / 26
There is a linear trend between X and Y. That is, all the points are
distributed around the blue line.
55 60 65 70
5
5
6
0
6
5
7
0
Mother's heights
D
a
u
g
h
t
e
r
'
s

h
e
i
g
h
t
s
() Linear Statistical Analysis I Fall 2013 6 / 26
Simple Linear Regression
It is reasonable to assume that at least in the region X [55, 70],
there is a linear relationship between X and Y but due to other
factor or random effect, the linear relationship is not exact.
The linear regression model is
Y =
0
+
1
X + ,
where
0
and
1
are two unknown parameters: intercept and
slope, and has the mean equal to 0 and the variance equal to
2
.
Totallly, we have three parameters:
0
,
1
and
2
.
In order to completely determine the model, we have to specify
the values of them.
() Linear Statistical Analysis I Fall 2013 7 / 26
Simple Linear Regression
We do not know the true values of the parameters. All the
information about the two variables we have are the data (the
observations).
We can only nd the values of parameters based on the data.
The procedure is called estimation of parameters. The values we
nd are called estimates and denoted by

0
,

1
and

2
.
Of course, we hope the estimates are as close to the true values
as possible.
Before I introduce the method of estimation, I show you the effect
of
2
on the patterns of the data.
() Linear Statistical Analysis I Fall 2013 8 / 26
Simple Linear Regression
We will use simulation to study the effect.
simulation means that given a statistical model, we generate
data from the model by using random number generating methods
in statistical software.
Soppose we have a linear regression model:
Y = 1 + 2X + ,
where X N(5, 5
2
) and N(0,
2
).
We will simulate data from this model for = 0, 1, 5, 10, 30,
separately.
() Linear Statistical Analysis I Fall 2013 9 / 26
> X=rnorm(1000,5,5)
> e=rep(0,1000)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
0 2 4 6 8 10
5
1
0
1
5
2
0
X
Y
() Linear Statistical Analysis I Fall 2013 10 / 26
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,1)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 11 / 26
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,5)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
10 5 0 5 10 15 20

2
0

1
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 12 / 26
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,10)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
10 5 0 5 10 15 20

4
0

2
0
0
2
0
4
0
6
0
X
Y
() Linear Statistical Analysis I Fall 2013 13 / 26
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,30)
> Y=1+2
*
X+e
> Z=1+2
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
15 10 5 0 5 10 15 20

5
0
0
5
0
1
0
0
X
Y
() Linear Statistical Analysis I Fall 2013 14 / 26
Simple Linear Regression
With the increasing of , the linear patterns become weaker and
disappear eventually.
() Linear Statistical Analysis I Fall 2013 15 / 26
(ordinary) least squares estimation of parameters
The least squares method can go back to the 19th century when
people tried to calculate the orbits of heavenly bodies. The famous
mathematician, Carl Friedrich Gauss, published this method.
An early demonstration of the strength of the method came when
it was used to predict the future location of the newly discovered
asteroid Ceres.
Roughly speaking, this method esimates the parameters by
minimizing the sum of squares of the residuals which is the
difference between the values calculated from the equaltion and
the observations.
() Linear Statistical Analysis I Fall 2013 16 / 26
least squares estimation of parameters
Why does the least squares method provide good estimates, that
is, the estimates which are close the true values of parameters?
Although theoretical analysis can demonstrate that this method
can nd good estimates, we will not perform the analysis in this
class.
Instead, we will use an example to illustrate the effective of this
method.
Consider the linear regression model:
Y = 1 + 2X + ,
where X N(5, 5
2
) and N(0, 5
2
).
Hence, the true value of (
0
,
1
) is (1, 2), we will considers several
different values of (
0
,
1
) and calculate the sum of squares S and
compare them with that of the true value.
() Linear Statistical Analysis I Fall 2013 17 / 26
> X=rnorm(1000,5,5)
> e=rnorm(1000,0,5)
> Y=1+2
*
X+e
> b0=1
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="blue",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 23587.24
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 18 / 26
> b0=0
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="red",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 24748.12
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 19 / 26
> b0=-10
> b1=2
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="red",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 146356.9
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 20 / 26
> b0=1
> b1=1
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="red",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 76990.82
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 21 / 26
> b0=1
> b1=-1
> Z=b0+b1
*
X
> plot(X,Y)
> lines(X,Z,col="red",lwd=3)
> S=sum((Y-Z)^2)
> S
[1] 501859.2
10 5 0 5 10 15 20

2
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 22 / 26
least squares estimation of parameters
We will calculate the least squares estimate by using the formula
and R command lm, separately, and compare the two results.
> sxy=cov(X,Y)
> b1=sxy/sxx
> b0=mean(Y)-b1
*
mean(X)
> b0
[1] 1.127409
> b1
[1] 1.991069
> fit=lm(Y~X)
> fit$coefficients
(Intercept) X
1.127409 1.991069
The estimate is close to the true value (1,2).
() Linear Statistical Analysis I Fall 2013 23 / 26
least squares estimation of parameters
The lm is the main command to t a linear regression model.
The basic arguments are
lm(formula, data, ...)
There are many other arguments which will be introduced later.
The result of the command is a list which contains at least the
following components
coefficients a named vector of coefficients
residuals the residuals, that is response minus fitted values.
fitted.values the fitted mean values.
.................................
() Linear Statistical Analysis I Fall 2013 24 / 26
> plot(X,Y)
> lines(X,fit$fitted.values,col="red",lwd=3)
> lines(X,Z,col="blue",lwd=6)
> lines(X,fit$fitted.values,col="red",lwd=3)
10 5 0 5 10 15 20

2
0

1
0
0
1
0
2
0
3
0
4
0
X
Y
() Linear Statistical Analysis I Fall 2013 25 / 26
least squares estimation of parameters
We will calculate residuals
> sum(fit$residuals^2)
[1] 23578.75
which is slightly smaller the sum of squares, 23587.24, for the true
values.
() Linear Statistical Analysis I Fall 2013 26 / 26

Você também pode gostar