Chapter 02

Chapter 2
Linear Regression Models, OLS, Assumptions and

Properties
2.1 The Linear Regression Model
The linear regression model is the single most useful tool in the econometrician’s kit.
The multiple regression model is the study if the relationship between a dependent
variable and one or more independent variables. In general it can be written as:
y = f (x1 , x2 , . . . , xK ) + ε (2.1)
= x1 β1 + x2 β2 + · · · + xK βK + ε.
The random disturbance ε arises because we cannot capture every influence on

an economic variable in a model. The net effect of the omitted variables is captured
by the disturbance. Notice that y is the summation of two parts, the deterministic
part and the random part, ε. The objective is to estimate the unknown parameters
(β1 , β2 , . . . , βK ) in this model.
2.2 Assumptions
The classical linear regression model consist of a set of assumptions how a data set
will be produced by the underlying ‘data-generating process.’ The assumptions are:
A1. Linearity
A2. Full rank
A3. Exogeneity of the independent variables
A4. Homoscedasticity and nonautocorrelation
A5. Data generation
A6. Normal distribution
5
6 2 Linear Regression Models, OLS, Assumptions and Properties
2.2.1 Linearity
The model specifies a linear relationship between y and x1 , x2 , . . . , xK . The column

vector xk contains the n observations of the variable xk , k = 1, 2, . . . K, which can
be written in a single n × K data matrix X. When the model is estimated with a
constant term, the first column of X is assumed to be a columns one ones, making
β1 the coefficient associated with the constant of the model. The vector y contains
the n observations, y1 , y2 , . . . , yn , and the vector ε contains all the n disturbances.
The model can be represented as:
       
y1 1 x11 x21 . . . xK1 β1 ε1
 y2   1 x12 x22 . . . xK2   β2   ε2 
=. . . . + . 
       
 ..  .   .. 
 .   .. .. .. . . ..   .   .. 
yn n×1 1 x1n x2n . . . xKn n×K βK K×1 εK n×1
In matrix form the model can be written as:
A SSUMPTION 1 : y = Xβ + ε (2.2)
Notice that the assumption means that Equation 2.2 can be either in terms of the
original variables or after some transformation. For example, consider the following
two equations:
y = Axβ eε
y = Axβ + ε.
While the first equation can be made linear by taking logs, the second equation is
not linear. Typical examples include the constant elasticity model:
ln y = β1 + β2 ln x2 + β3 x3 + · · · + βK xK + ε
where the elasticity between y and xk is given by ∂ ln y/∂ ln xk = βk . Another com-

mon model is the semilog.
2.2.2 Full rank
Full rank or identification condition means that there are no exact linear relation-
ships between the variables.
A SSUMPTION 2 : X is an n × K matrix with rank K. (2.3)

X is a full column rank means that the columns of X are linearly independent and
that there are at least K observations.
2.2 Assumptions 7
2.2.3 Exogeneity of the independent variables
Exogeneity of the independent variables means that each of the disturbance terms is
assumed to have zero expected value. This can be written as:
A SSUMPTION 3 : E[ε|X] = 0. (2.4)

This assumption means that the disturbances are purely random draws from some
population and that no observation on x convey information about the expected
value of the disturbance (ε and X are uncorrelated). Given Assumption 3, Xβ is
the conditional mean function because Assumption 3 implies that:
E[y|X] = Xβ . (2.5)
2.2.4 Homoscedasticity and nonautocorrelation
The combination of homoscedasticity and nonautocorrelation is also known as

spherical disturbances and it refers to the variances and covariances of the distur-
bances. Homoscedasticity means constant variance and can be written as:
Var[εi |X] = σ 2 for all i = 1, 2, · · · , n,

Cov[εi , ε j |X] = 0 for all i 6= j.
Typical examples of regression models with heteroscedastic errors are household

expenditures and firms profits. Autocorrelation (or serial correlation) on the other
side, means just correlation between the errors in different time periods. Hence,
autocorrelation is usually a problem with time series or panel data.
 2 
σ 0 ... 0
 0 σ2 ... 0 
′
E[εε |X] =  . . .
 
. 
 .. .. . . .. 
0 0 ... σ2
which can be written as:
A SSUMPTION 4 : E[εε ′ |X] = σ 2 I. (2.6)

2.2.5 Data generation
It is mathematically convenient to assume xi is nonstochastic, like in an agricultural

experiment where yi is yield and xi is the fertilizer and water applied. However,
social scientist are very likely to find stochastic xi . The assumption we will use is
that X can be a mixture of constant and random variables, and the mean and variance
of εi are both independent of all elements of X.
A SSUMPTION 5 : X may be fixed or random. (2.7)
2.2.6 Normal distribution
The last assumption, which is convenient, but not necessary to obtain many of the
results of the linear regression model is that the residuals follow a normal distribu-
tion with zero mean and constant variance. That is adding normality to Assumptions
3 and 4.
A SSUMPTION 6 : ε|X ∼ N[0, σ 2 I]. (2.8)

The assumptions of the linear regression model are summarized in Figure 2.1.
Fig. 2.1 Classical Regression Model, from [Greene (2008)].

2.3 Ordinary Least Squares Regression 9
2.3 Ordinary Least Squares Regression
The first distinction needed at this point is between population parameters and sam-
ple estimates. From the previous discussion we have β and εi as population param-
eters, hence we use b and ei as their sample estimates. For the population regression
we have E[yi |xi ] = x′i β , however β is unknown and we use its estimate b. Therefore
we have:
E[yi |xi ] = ŷi = x′i b. (2.9)
For observation i, the (population) disturbance term is given by:
εi = yi − x′i β . (2.10)
Once we estimate b the estimate of the disturbance term εi is its sample counter-
part, the residual:1
ei = yi − x′i b. (2.11)
It follows that
yi = x′i β + εi = x′i b + ei . (2.12)
A graphical summary of this discussion is presented in Figure 2.2. This figure shows
the simple example of a single regressor.
2.3.1 Least Squares Coefficients
The problem in hand is to obtain an estimate of the unknown population vector β

based on the sample data (yi , xi ) for i = 1, 2, · · · , n. In this section we will derive the
least squares estimator vector for β , denoted by b. By definition, the least squares
coefficient vector minimizes the sum of squared residuals:
n n
∑ e2i0 = ∑ (yi − x′i b0 )2 . (2.13)
i=1 i=1
The idea is to pick the vector b0 that makes the summation in Equation 2.13 the
smallest. In matrix notation:
min S(b0 ) = e′0 e0 = (y − Xb0 )′ (y − Xb0 ). (2.14)

b0
min S(b0 ) = y′ y − b′0 X′ y − y′ Xb0 + b′0 X′ Xb0
b0
min S(b0 ) = y′ − 2y′ Xb0 + b′0 X′ Xb0 .
b0
1 [Dougherty (2007)] follows a similar notation, but most textbooks, e.g. [Wooldridge (2009)], use
β̂ as the sample estimate of β .
Fig. 2.2 Population and Sample Regression, from [Greene (2008)].
The first order necessary condition is:
∂ S(b0 )
= −2X′ y + 2XXb0 = 0. (2.15)
∂ b0
Let b be the solution. Then, given that X is full rank, (X’X)−1 exists and the solution
is:
b = (X′ X)−1 X′ y. (2.16)
The second order condition is:
∂ 2 S(b0 )
= 2X′ y + 2XXb0 = 0. (2.17)
∂ b0 ∂ b′0
That is satisfied if it yields a positive definite matrix. This will be the case if X is full
rank, then the least squares solution b is unique and minimizes the sum of squared
residuals.
Example 1 Derivation of the least squares coefficient estimators for the simple
case of a single regressor and a constant.
yi = b0 + b1 xi + ei (2.18)
ŷi = b0 + b1 xi
For observation i we obtain the residual, then square it and finally sum across
all observations to obtain the sum of squared residuals:
ei = yi − ŷi (2.19)
e2i = (yi − ŷi )2
n n
∑ e2i = ∑ (yi − ŷi )2
i=1 i=1
Again, the coefficients b0 and b1 are chosen to minimize the sum of squared
residuals:
min ∑ni=1 (yi − ŷi )2 (2.20)

b0 ,b1
min ∑ni=1 (yi − b0 − b1 xi )2
b0 ,b1
The first order necessary condition are:

n
−2 ∑ (yi − b0 − b1 xi ) = 0 w.r.t. b0 (2.21)
i=1
n
−2 ∑ xi (yi − b0 − b1 xi ) = 0 w.r.t. b1 (2.22)
i=1
Dividing Equation 2.22 by n and working through some math we obtain the
OLS estimators for the constant:
b0 = ȳ − b1 x̄. (2.23)
Plugging this result into Equation 2.22 we obtain:
∑ni=0 (xi − x̄)(yi − ȳ)

b1 = . (2.24)
∑ni=0 (xi − x̄)2
2.3.2 Normal equations
From the first order conditions in Equation 2.15 we can obtain the normal equations:
X′ Xb − X′ y = −X′ (y − Xb) = −X′ e = 0. (2.25)

Therefore, following X′ e = 0 we can derive a number of properties:

1. The observed values of X are uncorrelated with the residuals. For every column
of X, x′k e = 0.
In addition, if the regression includes a constant:
2. The sum of the residuals is zero. x′1 e= i′ e = ∑i ei = 0.
3. The sample mean of the residuals is zero. ē = ∑ni ei = 0.
4. The regression hyperplane passes through the means of the data. This follows
from ē = 0. Recall that e = y − Xb. Dividing by n, we have ē = ȳ − x̄b. This
implies that ȳ = x̄′ b.
5. The predicted values of y are uncorrelated with the residuals. ŷ′ e = (Xb)′ e =
b′ X′ e = 0.
6. The mean of the fitted values is equal to the mean of the actual values. Because
y = ŷ + e. We have ∑i ei = 0, then ŷ¯ = ȳ.
2.3.3 Projection matrix
The matrix M (residual maker) is fundamental in regression analysis. It is given by:
M = I − X(X′ X)−1 X′ . (2.26)
It generates the vector of least square residuals in a regression of y on X when it

premultiplies any vector y. It can be easily derived from the least square residuals:
e = y − Xb (2.27)
′ −1 ′
= y − X(X X) X y
= (I − X(X′ X)−1 X′ )y
= My.
M is a symmetric (M = M′ ) and idempotent (M = M2 ) n × n matrix. For example,

it is useful to see that if we regress X on X we have a perfect fit and the residuals
should be zero:
MX = 0. (2.28)
The projection matrix P is also a symmetric and idempotent matrix formed from X.
When y is premultiplied by P, it results on the the fitted values ŷ in the regression
of y on X. In is given by:
P = X(X′ X)−1 X′ . (2.29)
It can be obtained by starting from the equation y = Xb + e. We know that ŷ = Xb,
then y = ŷ + e that gives:
ŷ = y − e (2.30)
= y − My
= (I − M)y
= X(X′ X)−1 X′ y
= Py.
Notice that M = I − P. In addition, M and P are orthogonal and in a regression of

X on X, the fitted values are also X, that is, PX = X. Following from the results of
these M and P matrices, we can see that the least squares regression partitions the
vector y into two orthogonal parts, the projection and the residual,
y = Py + My. (2.31)
2.3.4 Goodness of fit and analysis of variance
The variation of the dependent variable is captured in terms of deviations from its
mean, yi − ȳ. Then the total variation in y is the sum of square deviations:
n
SST = ∑ (yi − ȳ)2 . (2.32)
i=1
To decompose this sum of square deviations into the part the regression model ex-
plain and the part the model does not explain, we first look at a single observation
to get some intuition. For observation i we have:
yi = ŷi + ei = x′i b + ei . (2.33)
Subtracting ȳ we obtain:
yi − ȳ = ŷi − ȳ + ei = (x′i − x̄)b + ei . (2.34)
Figure 2.3 illustrates the intuition for the case of a single regressor.
Let the symmetric and idempotent matrix M0 have (1 − 1/n) in all its diagonal
elements and −1/n in all its off-diagonal elements:
1 − 1n − 1n . . . − 1n
 
1 1 1 
1 i   −n 1− n ... −n 
h
M0 = I − ii′ =  . .. . . . 
n  .. . . .. 
− 1n − 1n . . . 1 − 1n
M0 transforms observations into deviations from sample means. Then, M0 is useful

in computing sum of square deviations. For example, the sum of squared deviations
about the mean for xi is:
n
∑ (xi − x̄)2 = x′ M0 x. (2.35)
i=1
Fig. 2.3 Decomposition of yi , from [Greene (2008)].
Now, if we start with y = Xb + e and premultiply it by M0 we obtain:
M0 y = M0 Xb + M0 e (2.36)
Then, we transpose this equation to obtain:
y′ M0 = b′ X′ M0 + e′ M0 (2.37)
Premultiply Equation 2.36 by Equation 2.37:
y′ M0 y = (b′ X′ M0 + e′ M0 )(M0 Xb + M0 e) (2.38)

= b′ X′ M0 Xb + b′ X′ M0 e + e′ M0 Xb + e′ M0 e
= b′ X′ M0 Xb + e′ e.
The second term on the right-hand-side in the last line is zero because M0 e = e and
X′ e = 0, while the third term is zero because e′ M0 X = e′ X = 0 (the regressors are
orthogonal to the residuals). Equation 2.38 show the decomposition of the total sum
of squares into regression sum of squares plus the error sum of squares:
SST = SSR + SSE (2.39)

2.4 Properties of OLS 15
If we calculate the fraction of the total variation in y that is explained by the model,
we are talking about the coefficient of determination, R2 :
SSR b′ X′ M0 Xb e′ e
R2 = = = 1 − (2.40)
SST y ′ M0 y y′ M0 y
As we include variables into the model the R2 will never decrease. Hence, for small
samples, a better measure of fit is the adjusted R2 or R̄2 :
e′ e/(n − K)
R̄2 = 1 − (2.41)
y′ M0 y/(n − 1)
2.4 Properties of OLS
2.4.1 Unbiasedness
The least square estimator b is unbiased for every value of n.

−1
b = (X′ X) X′ y (2.42)
−1
= (X′ X) X′ (Xβ + ε)
−1
= β + (X′ X) X′ ε
−1
E[b|X] = β + E[(X′ X) X′ ε]
= β.
The second term after taking expectations is zero because the errors are assumed to
be orthogonal to the regression residuals.
2.4.2 Variance and the Gauss-Markov Theorem
It is relatively simple to derive the sampling variance of the OLS estimators. How-
ever, the key assumption in the derivation is that the matrix X is constant. If X is
not constant, then the expectations should be taken conditional on the observed X.
From the derivation of the unbiasedness of OLS we have that b − β = (X′ X)−1 X′ ε.
Using this in the variance-covariance matrix of the OLS we have:
Var[b|X] = E[(b − β )(b − β )′ |X]

−1 −1
= E[((X′ X) X′ ε)((X′ X) X′ ε)′ |X]
−1 −1
= E[((X′ X) X′ ε)(ε ′ X(X′ X) )|X]
−1 −1
= (X′ X) X′ E[εε ′ |X]X(X′ X)
−1 −1
= (X′ X) X′ (σ 2 I)X(X′ X)
−1
= σ 2 (X′ X) (2.43)
Gauss-Markov Theorem. In a linear regression model (with spherical distur-

bances), the Best Linear Unbiased Estimator (BLUE) of the coefficients is the ordi-
nary least squares estimator.
In the Gauss-Markov Theorem, best refers to minimum variance. In addition, errors
do not need to have a normal distribution and the X could be either stochastic or
nonstochastic.
2.4.3 Estimating the Variance
In order to obtain a sample estimate of the variance-covariance matrix presented in

Equation2.43, we need an estimate of the population parameter σ 2 . We can use:
1 n 2
σ̂ 2 = ∑ ei , (2.44)
n i=1
which makes sense because ei is the sample estimate of εi , and σ 2 is the expected
value of εi2 . However, this one is biased because β is not observed directly. To
obtain an unbiased estimator of σ 2 we can start with the expected value of the sum
of squared residuals. Recall that e = My = M[Xβ + ε] = M ε. Then, the sum of
squared residuals is:
e′ e = ε ′ Mε (2.45)
E[e e|X] = E[ε ′ Mε|X]
′
= E[tr(ε ′ Mε)|X]
= E[tr(Mεε ′ )|X]
= tr(ME[εε ′ |X])
= tr(Mσ 2 I)
= σ 2 tr(M)
= σ 2 tr(In − X(X′ X)−1 X′ )
= σ 2 [tr(In ) − tr(X(X′ X)−1 X′ )]
= σ 2 [tr(In ) − tr(IK )]
= σ 2 (n − K),
where ε ′ M ε is a scalar (1 × 1 matrix), so it it equal to its trace, and the operation

from the third to the fourth line follows from the results on cyclic permutations.
From Equation 2.45 we obtain the unbiased estimator of σ 2 :
2.4 Properties of OLS 17
e′ e
s2 = . (2.46)
n−K
Hence, the standard errors of the estimators b can be obtained by first obtaining an
estimate of σ 2 using Equation 2.46 and then plugging s2 into Equation 2.43.
2.4.4 Statistical Inference
Given that b is a linear function of ε, if ε has a multivariate normal distribution we

have that:
b|X ∼ N[β , σ 2 (X′ X)−1 ]. (2.47)
2.4.4.1 Hypothesis Testing
Assuming normality conditional on X and with Skk being the kth diagonal element
of X’X−1 we have that:
bk − βk
zk = √ (2.48)
σ 2 Skk
has a standard normal distribution. However, σ 2 is an unknown population parame-
ter. Hence, we use:
bk − βk
tk = √ (2.49)
s2 Skk
that has a t distribution with (n − K) degrees of freedom. We use Equation 2.49 for
hypotheses testing about the elements of β .
2.4.4.2 Confidence Intervals
Based on Equation 2.49 we can obtain the (1 − α) confidence interval for the popu-
lation parameter βk using:
P(bk − tα/2,n−K sbk ≤ βk ≤ bk + tα/2,n−K sbk ) = 1 − α. (2.50)
What this equation is saying is that the true population parameter βk will lie between
the lower confidence interval bk −tα/2,n−K sbk and the upper confidence interval bk +
tα/2,n−K sbk in (1 − α)% of the times. Moreover, tα/2,n−K is the critical value from
the t distribution with (n − K) degrees of freedom. This is illustrated in Figure 2.4.
WƌŽďĂďŝůŝƚǇĚĞŶƐŝƚǇ
ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ
Fig. 2.4 Confidence Intervals.
References
[Dougherty (2007)] Dougherty, C., 2007. Introduction to Econometrics. 3rd ed. New York: Ox-
ford University Press.
[Greene (2008)] Greene, W.H. 2008. Econometric Analysis. 6th ed. New Jersey: Pearson Prentice
Hall.
[Wooldridge (2009)] Wooldridge, J.M., 2009. Introductory Econometrics: A Models Approach.
4th ed. New York: South-Western Publishers.

Chapter 02

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Chapter 02

Enviado por

Direitos autorais:

Formatos disponíveis

Chapter 2

Linear Regression Models, OLS, Assumptions and

2.1 The Linear Regression Model

The random disturbance ε arises because we cannot capture every influence on

The model specifies a linear relationship between y and x1 , x2 , . . . , xK . The column

In matrix form the model can be written as:

where the elasticity between y and xk is given by ∂ ln y/∂ ln xk = βk . Another com-

2.2.2 Full rank

A SSUMPTION 2 : X is an n × K matrix with rank K. (2.3)

2.2.3 Exogeneity of the independent variables

A SSUMPTION 3 : E[ε|X] = 0. (2.4)

2.2.4 Homoscedasticity and nonautocorrelation

The combination of homoscedasticity and nonautocorrelation is also known as

Var[εi |X] = σ 2 for all i = 1, 2, · · · , n,

Typical examples of regression models with heteroscedastic errors are household

which can be written as:

A SSUMPTION 4 : E[εε ′ |X] = σ 2 I. (2.6)

2.2.5 Data generation

It is mathematically convenient to assume xi is nonstochastic, like in an agricultural

A SSUMPTION 5 : X may be fixed or random. (2.7)

2.2.6 Normal distribution

A SSUMPTION 6 : ε|X ∼ N[0, σ 2 I]. (2.8)

Fig. 2.1 Classical Regression Model, from [Greene (2008)].

2.3 Ordinary Least Squares Regression

2.3.1 Least Squares Coefficients

The problem in hand is to obtain an estimate of the unknown population vector β

min S(b0 ) = e′0 e0 = (y − Xb0 )′ (y − Xb0 ). (2.14)

Fig. 2.2 Population and Sample Regression, from [Greene (2008)].

The first order necessary condition is:

min ∑ni=1 (yi − ŷi )2 (2.20)

The first order necessary condition are:

Plugging this result into Equation 2.22 we obtain:

∑ni=0 (xi − x̄)(yi − ȳ)

2.3.2 Normal equations

X′ Xb − X′ y = −X′ (y − Xb) = −X′ e = 0. (2.25)

Therefore, following X′ e = 0 we can derive a number of properties:

2.3.3 Projection matrix

The matrix M (residual maker) is fundamental in regression analysis. It is given by:

M = I − X(X′ X)−1 X′ . (2.26)

It generates the vector of least square residuals in a regression of y on X when it

M is a symmetric (M = M′ ) and idempotent (M = M2 ) n × n matrix. For example,

Notice that M = I − P. In addition, M and P are orthogonal and in a regression of

2.3.4 Goodness of fit and analysis of variance

yi = ŷi + ei = x′i b + ei . (2.33)

yi − ȳ = ŷi − ȳ + ei = (x′i − x̄)b + ei . (2.34)

M0 transforms observations into deviations from sample means. Then, M0 is useful

Fig. 2.3 Decomposition of yi , from [Greene (2008)].

Now, if we start with y = Xb + e and premultiply it by M0 we obtain:

Then, we transpose this equation to obtain:

Premultiply Equation 2.36 by Equation 2.37:

y′ M0 y = (b′ X′ M0 + e′ M0 )(M0 Xb + M0 e) (2.38)

SST = SSR + SSE (2.39)

2.4 Properties of OLS

The least square estimator b is unbiased for every value of n.

2.4.2 Variance and the Gauss-Markov Theorem

Var[b|X] = E[(b − β )(b − β )′ |X]

Gauss-Markov Theorem. In a linear regression model (with spherical distur-

2.4.3 Estimating the Variance

In order to obtain a sample estimate of the variance-covariance matrix presented in

where ε ′ M ε is a scalar (1 × 1 matrix), so it it equal to its trace, and the operation

2.4.4 Statistical Inference

Given that b is a linear function of ε, if ε has a multivariate normal distribution we

2.4.4.1 Hypothesis Testing

2.4.4.2 Confidence Intervals