Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne

Chapter 2: Maximum Likelihood Estimation
Advanced Econometrics - HEC Lausanne
Christophe Hurlin
University of Orlans
December 9, 2013
Christophe Hurlin (University of Orlans) Advanced Econometrics - HEC Lausanne December 9, 2013 1 / 207
Section 1
Introduction
1. Introduction
The Maximum Likelihood Estimation (MLE) is a method of

estimating the parameters of a model. This estimation method is one
of the most widely used.
The method of maximum likelihood selects the set of values of the

model parameters that maximizes the likelihood function. Intuitively,
this maximizes the "agreement" of the selected model with the
observed data.
The Maximum-likelihood Estimation gives an unied approach to

estimation.
2. The Principle of Maximum Likelihood
What are the main properties of the maximum likelihood estimator?

I Is it asymptotically unbiased?
I Is it asymptotically e cient? Under which condition(s)?
I Is it consistent?
I What is the asymptotic distribution?
How to apply the maximum likelihood principle to the multiple linear

regression model, to the Probit/Logit Models etc. ?
... All of these questions are answered in this lecture...
1. Introduction
The outline of this chapter is the following:

Section 2: The principle of the maximum likelihood estimation
Section 3: The likelihood function
Section 4: Maximum likelihood estimator
Section 5: Score, Hessian and Fisher information
Section 6: Properties of maximum likelihood estimators
1. Introduction
References
Amemiya T. (1985), Advanced Econometrics. Harvard University Press.
Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil
Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne (a

special thank)
Ruud P., (2000) An introduction to Classical Econometric Theory, Oxford
University Press.
Zivot, E. (2001), Maximum Likelihood Estimation, Lecture notes.
Section 2
The Principle of Maximum Likelihood
Objectives
In this section, we present a simple example in order
1 To introduce the notations

2 To introduce the notion of likelihood and log-likelihood.
3 To introduce the concept of maximum likelihood estimator
4 To introduce the concept of maximum likelihood estimate
Example
Suppose that X1 ,X2 , ,XN are i.i.d. discrete random variables, such that
Xi Pois ( ) with a pmf (probability mass function) dened as:
exp ( ) xi
Pr (Xi = xi ) =
xi !
where is an unknown parameter to estimate.
Question: What is the probability of observing the particular sample

fx1 , x2 , .., xN g, assuming that a Poisson distribution with as yet unknown
parameter generated the data?
This probability is equal to
Pr ((X1 = x1 ) \ ... \ (XN = xN ))
Since the variables Xi are i.i.d. this joint probability is equal to the
product of the marginal probabilities
N
Pr ((X1 = x1 ) \ ... \ (XN = xN )) = Pr (Xi = xi )
i =1
Given the pmf of the Poisson distribution, we have:

N
exp ( ) xi
Pr ((X1 = x1 ) \ ... \ (XN = xN )) = xi !
i =1
N
i =1 x i
= exp ( N ) N
xi !
i =1
Denition
This joint probability is a function of (the unknown parameter) and
corresponds to the likelihood of the sample fx1 , .., xN g denoted by
LN (; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))
with
N 1
LN (; x1 .., xN ) = exp ( N ) =1 x i N
xi !
i =1
Example
Let us assume that for N = 10, we have a realization of the sample equal
to f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , then:
LN (; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))
e 10 20
LN (; x1 .., xN ) =
207, 360
Question: What value of would make this sample most probable?
This Figure plots the function LN (; x ) for various values of . It has a
single mode at = 2, which would be the maximum likelihood estimate,
or MLE, of .
-8
x 10
1.2
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4

Consider maximizing the likelihood function LN (; x1 .., xN ) with respect to

. Since the log function is monotonically increasing, we usually maximize
ln LN (; x1 .., xN ) instead. In this case:
N N
ln LN (; x1 .., xN ) = N + ln ( ) xi ln xi !
i =1 i =1
ln LN (; x1 .., xN ) 1 N
i
= N+ xi
=1
N
2 ln LN (; x1 .., xN ) 1
2
=
2
xi < 0
i =1
Under suitable regularity conditions, the maximum likelihood estimate

(estimator) is dened as:
b
= arg max ln LN (; x1 .., xN )
2R+
ln LN (; x1 .., xN ) 1 N
FOC :
b
= N+
b xi = 0
i =1

N
() b = (1/N ) xi
i =1
N
2 ln LN (; x1 .., xN ) 1
SOC :
2 b
=
b 2 xi < 0
i =1
b
is a maximum.
The maximum likelihood estimate (realization) is:

N
1
b
b
(x ) =
N xi
i =1
Given the sample f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , we have b

(x ) = 2.
The maximum likelihood estimator (random variable) is:

N
1
b
=
N Xi
i =1
Continuous variables
The reference to the probability of observing the given sample is not
exact in a continuous distribution, since a particular sample has
probability zero. Nonetheless, the principle is the same.
The likelihood function then corresponds to the pdf associated to the

joint distribution of (X1 , X2 , .., XN ) evaluated at the point
(x1 , x2 , .., xN ) :
LN (; x1 .., xN ) = fX 1 ,..,X N (x1 , x2 , .., xN ; )
Continuous variables
If the random variables fX1 , X2 , .., XN g are i.i.d. then we have:
N
LN (; x1 .., xN ) = fX (xi ; )
i =1
where fX (xi ; ) denotes the pdf of the marginal distribution of X (or

Xi since all the variables have the same distribution).
The values of the parameters that maximize LN (; x1 .., xN ) or its log

are the maximum likelihood estimates, denoted b
(x ).
Section 3
The Likelihood function
Denitions and Notations
3. The Likelihood Function
Objectives
1 Introduce the notations for an estimation problem that deals with a

marginal distribution or a conditional distribution (model).
2 Dene the likelihood and the log-likelihood functions.
3 Introduce the concept of conditional log-likelihood
4 Propose various applications
Notations
Let us consider a continuous random variable X , with a pdf denoted

fX (x; ) , for x 2 R
= ( 1 .. K )| is a K 1 vector of unknown parameters. We assume

that 2 RK .
Let us consider a sample fX1 , .., XN g of i.i.d. random variables with

the same arbitrary distribution as X .
The realisation of fX1 , .., XN g (the data set..) is denoted fx1 , .., xN g
or x for simplicity.
Example (Normal distribution)

If X N m, 2 then:
!
1 (z m )2
fX (z; ) = p exp 8z 2 R
2 22
with K = 2 and
m
=
2
Denition (Likelihood Function)

The likelihood function is dened to be:
LN : RN ! R+
N
(; x1 , .., xn ) 7 ! LN (; x1 , .., xn ) = fX (xi ; )
i =1
Denition (Log-Likelihood Function)

The log-likelihood function is dened to be:
`N : RN ! R
N
(; x1 , .., xn ) 7 ! `N (; x1 , .., xn ) = ln fX (xi ; )
i =1
Remark: the (log-)likelihood function depends on two type of arguments:
Notations: In the rest of the chapter, I will use the following alternative
notations:
LN (; x ) L (; x1 , .., xN ) LN ( )
`N (; x ) ln LN (; x ) ln L (; x1 , .., xN ) ln LN ( )
Example (Sample of Normal Variables)

We consider a sample fY1 , .., YN g N .i.d. m, 2 and denote the
|
realisation by fy1 , .., yN g or y . Let us dene = m 2 , then we have:
!
N
1 (yi m)2
LN (; y ) = p exp
i =1 2 22
!
N
N /2 1
22 i
= 2 2 exp (yi m)2
=1
N
N N 1
`N (; y ) =
2
ln 2
2
ln (2 )
22 (yi m )2
i =1
Denition (Likelihood of one observation)

We can also dene the (log-)likelihood of one observation xi :
N
Li (; x ) = fX (xi ; ) with LN (; x ) = Li (; x )
i =1
N
ì (; x ) = ln fX (xi ; ) with `N (; x ) = ì (; x )
i =1
Example (Exponential Distribution)

Suppose that D1 , D2 , .., DN are i.i.d. positive random variables (durations
for instance), with Di Exp ( ) with 0 and
1 di
Li (; di ) = fD (di ; ) = exp

di
ì (; di ) = ln (fD (di ; )) = ln ( )

Then we have: !
1 N
i
N
LN (; d ) = exp di
=1
1 N
i
`N (; d ) = N ln ( ) di
=1
Remark: The (log-)likelihood and the Maximum Likelihood Estimator are

always based on an assumption (bet?) about the distribution of Y .
Yi Distribution with pdf fY (y ; ) =) LN (; y ) and `N (; y )
In practice, generally we have no idea about the true distribution of Yi ....

A solution: the Quasi-Maximum Likelihood Estimator
Remark: We can also use the MLE to estimate the parameters of a

model (with dependent and explicative variables) such that:
y = g (x; ) +
where denotes the vector or parameters, X a set of explicative variables,

and error term and g (.) the link function.
In this case, we generally consider the conditional distribution of Y given
X , which is equivalent to unconditional distribution of the error term :
YjX D () D
Notations (model)
Let us consider two continuous random variables Y and X

We assume that Y has a conditional distribution given X = x with a
pdf denoted f Y jx (y ; ) , for y 2 R
= ( 1 .. K )| is a K 1 vector of unknown parameters. We assume

that 2 RK .
Let us consider a sample fX1 , YN gN

i =1 of i.i.d. random variables and
N
a realisation fx1 , yN gi =1 .
Denition (Conditional likelihood function)

The (conditional) likelihood function is dened to be:
N
LN (; y j x ) = f Y jX ( yi j xi ; )
i =1
where f Y jX ( yi j xi ; ) denotes the conditional pdf of Yi given Xi .
Remark: The conditional likelihood function is the joint conditional

density of the data in which the unknown parameter is .
Denition (Conditional log-likelihood function)

The (conditional) log-likelihood function is dened to be:
N
`N (; y j x ) = ln f Y jX ( yi j xi ; )
i =1
where f Y jX ( yi j xi ; ) denotes the conditional pdf of Yi given Xi .
Remark: The conditional probability density function (pdf) can denoted

by:
f Y jX ( y j x; ) fY ( y j X = x; ) fY ( y j X = x )
Example (Linear Regression Model)
Consider the following linear regression model:
yi = Xi> + i
where Xi is a K 1 vector of random variables and = ( 1 ..K )> a

K 1 vector of parameters. We assume that the i are i.i.d. with
i N 0, 2 . Then, the conditional distribution of Yi given Xi = xi is:
Yi j xi N xi> , 2
2!
1 yi xi>
Li (; y j x) = f Y jx ( yi j xi ; ) = p exp
2 22
>
where = > 2 is K + 1 1 vector.
Example (Linear Regression Model, contd)

Then, if we consider an i.i.d. sample fyi , xi gN
i =1 , the corresponding
conditional (log-)likelihood is dened to be:
2!
N N
1 yi xi>
LN (; y j x) = f Y jX ( yi j xi ; ) = p exp
2 22
i =1 i =1
!
N
N /2 1 2
= 2 2 exp
22 yi xi>
i =1
N
N N 1 2
`N (; y j x) =
2
ln 2
2
ln (2 )
22 yi xi>
i =1
Remark: Given this principle, we can derive the (conditional) likelihood

and the log-likelihood functions associated to a specic sample for any
type of econometric model in which the conditional distribution of the
dependent variable is known.
Dichotomic models: probit, logit models etc.
Censored regression models: Tobit etc.
Times series models: AR, ARMA, VAR etc.
GARCH models
....
Example (Probit/Logit Models)

Let us consider a dichotomic variable Yi such that Yi = 1 if the rm i is in
default and 0 otherwise. Xi = (Xi 1 ...XiK ) denotes a a K 1 vector of
individual caracteristics. We assume that the conditional probability of
default is dened as:
Pr ( Yi = 1j Xi = xi ) = F xi>
where = ( 1 ..K )> is a vector of parameters and F (.) is a cdf

(cumlative distribution function).
1 with probability F xi>

Yi =
0 with probability 1 F xi>
Remark: Given the choice of the link function F (.) we get a probit or a
logit model.
Denition (Probit Model)

In a probit model, the conditional probability of the event Yi = 1 is:
>
xR
i 1 u2
Pr ( Yi = 1j Xi = xi ) = (xi ) = p exp du
2 2
where (.) denotes the cdf of the standard normal distribution.
Denition (Logit Model)

In a logit model, the conditional probability of the event Yi = 1 is:
1
Pr ( Yi = 1j Xi = xi ) = xi> =
1 + exp xi>
where (.) denotes the cdf of the logistic distribution.
Example (Probit/Logit Models, contd)

What is the (conditional) log-likelihood of the sample fyi , xi gN
i =1 ?
Whatever the choice of F (.), the conditional distribution of Yi given
Xi = xi is a Bernouilli distribution since:
1 with probability F xi>

Yi =
0 with probability 1 F xi>
Then, for = , we have:

h iy i h i1 yi
Li (; y j x) = f Y jx ( yi j xi ; ) = F xi> 1 F xi>
where f Y jx ( yi j xi ; ) denotes the conditional probability mass function

(pmf) of Yi .
Example (Probit/Logit Models, contd)

The (conditional) likelihood and log-likelihood of the sample fyi , xi gN
i =1 are
dened to be:
N N h iy i h i1 yi
LN (; y j x) = f Y jx ( yi j xi ; ) = F xi> 1 F xi>
i =1 i =1
N h i N h i
`N (; y j x) = yi ln F xi> + (1 yi ) ln 1 F xi>
i =1 i =1
h i
= ln F xi> + ln 1 F xi>
i : y i =1 i : y i =0
where f Y jx ( yi j xi ; ) denotes the conditional probability mass function

(pmf) of Yi .
Key Concepts
1 Likelihood (of a sample) function

2 Log-likelihood (of a sample) function
3 Conditional Likelihood and log-likelihood function
4 Likelihood and log-likelihood of one observation
Section 4
Maximum Likelihood Estimator
4. Maximum Likelihood Estimator
Objectives
1 This section will be concerned with obtaining estimates of the

parameters .
2 We will dene the maximum likelihood estimator (MLE).
3 Before we begin that study, we consider the question of whether
estimation of the parameters is possible at all: the question of
identication.
4 We will introduce the invariance principle
Denition (Identication)
The parameter vector is identied (estimable) if for any other parameter
vector, 6= , for some data y , we have
LN (; y ) 6= LN ( ; y )
Example
Let us consider a latent (continuous and unobservable) variable Yi such
that:
Yi = Xi> + i
with = ( 1 ..K )> , Xi = (Xi 1 ...XiK )> and where the error term i is
i.i.d. such that E (i ) = 0 and V (i ) = 2 . The distribution of i is
symmetric around 0 and we denote by G (.) the cdf of the standardized
error term i /. We assume that this cdf does not depend on or .
Example: i / N (0, 1).
Example (contd)
We observe a dichotomic variable Yi such that:
1 if Yi > 0
Yi =
0 otherwise
Problem: are the parameters = ( > 2 )> identiable?
Solution:
To answer to this question we have to compute the (log-)likelihood of the
sample of observed data fyi , xi gN
i =1 . We have:
Pr ( Yi = 1j Xi = xi ) = Pr ( Yi > 0j Xi = xi )
= Pr i > xi>
= 1 Pr i xi>
i
= 1 Pr xi>

If we denote by G (.) the cdf associated to the distribution of i /, since
this distribution is symetric around 0, then we have:

Pr ( Yi = 1j Xi = xi ) = G xi>

Solution (contd):
For = ( > 2 )> , we have
N N

`N (; y j x) = yi ln G xi>

+ (1 yi ) ln 1 G xi>

i =1 i =1
This log-likelihood depends only on the ratio /. So, for = ( > 2 )>
and = (k > k )> , with k 6= 1 :
`N (; y j x) = `N ( ; y j x)
The parameters and 2 cannot be identied. We can only identify the

ratio /.
Remark:
In this latent model, only the ratio / can be identied since
i
Pr ( Yi = 1j Xi = xi ) = Pr < xi> =G xi>

The choice of a logit or probit model implies a normalisation on the

variance of i / and then on 2 :
e
probit : Pr ( Yi = 1j Xi = xi ) = xi> e = /, V i
with =1
i

Denition (Maximum Likelihood Estimator)

A maximum likelihood estimator b
of 2 is a solution to the
maximization problem:
b
= arg max `N (; y j x )
2
or equivalently
b
= arg max LN (; y j x )
2
Remarks
1 Do not confuse the maximum likelihood estimator b (which is a
random variable) and the maximum likelihood estimate b (x ) which
corresponds to the realisation of b
on the sample x.
2 Generally, it is easier to maximise the log-likelihood than the
likelihood (especially for the distributions that belong to the
exponential family).
3 When we consider an unconditional likelihood, the MLE is dened by:
b
= arg max`N (; x )
2
Denition (Likelihood equations)

Under suitable regularity conditions, a maximum likelihood estimator
(MLE) of is dened to be the solution of the rst-order conditions
(FOC):
`N (; y j x )
= 0
b
(K ,1 )
or
LN (; y j x )
= 0
b
(K ,1 )
These conditions are generally called the likelihood or log-likelihood

equations.
Notations
The rst derivative (gradient) of the (conditional) log-likelihood evaluated
at the point b
satises:
LN (; y j x ) LN b
; y j x
= g b; y j x = 0
b

Remark
The log-likelihood equations correspond to a linear/nonlinear system of
K equations with K unknown parameters 1 , .., K :
0 1
`N (; Y jx ) 0 1
B 1 b C 0
`N (; Y j x )
=B
@ ... C = @ ... A
A
b `N (; Y jx )
0
K b

Denition (Second Order Conditions)

Second order condition (SOC) of the likelihood maximisation problem: the
Hessian matrix evaluated at b
must be negative denite.
2 `N (; y j x )
is negative denite
> b

or
2 LN (; y j x )
is negative denite
> b

Remark:
The Hessian matrix (realisation) is a K K matrix:
0 2 `N (; y jx ) 2 `N (; y jx ) 2 `N (; y jx )
1
21 1 2 .. 1 K
B C
B 2
`N (; y jx ) 2 `N (; y jx ) C
2 ` N ( ; y j x ) B .. .. C
= B 2 1 22 C
> B C
B .. .. .. .. C
@ A
2 `N ( ; y jx ) 2 ` N ( ; y jx )
K 1 .. .. 2K
Reminders
A negative denite matrix is a symetric (Hermitian if there are

complex entries) matrix all of whose eigenvalues are negative.
The n n Hermitian matrix M is said to be negative-denite if:
x| Mx < 0
for all non-zero x in Rn .
Example (MLE problem with one parameter)

Let us consider a real-valued random variable X with a pdf given by:
x2 x
fX x; 2 = exp 8x 2 [0, +[
22 2
where 2 is an unknown parameter. Let us consider a sample fX1 , .., XN g

of i.i.d. random variables with the same arbitrary distribution as X .
Problem: What is the maximum likelihood estimator (MLE) of 2 ?
Solution:
We have:
x2
ln fX x; 2 =
+ ln (x ) ln 2
22
So, the log-likelihood of the sample fx1 , .., xN g is:
N N N
1
`N 2 ; x = ln fX xi ; 2 =
22 xi2 + ln (xi ) N ln 2
i =1 i =1 i =1
Solution (contd):
b2 of 2 2 R+ is a solution to the
The maximum likelihood estimator
maximization problem:
N N
1
b2 = arg max`N 2 ; x = arg max

22 xi2 + ln (xi ) N ln 2
2 2R+ 2 2R+ i =1 i =1
`N 2 ; x 1 N
N
2
= 4
2 xi2 2
i =1
FOC (log-likelihood equation):
`N 2 ; x 1 N
N 1 N
2
=
4
2b
xi2 b
2
b2 =
= 0 ()
2N xi2
b2
i =1 i =1
Solution (contd):
b2 is a maximum:
Check that
`N 2 ; x 1 N
N 2 `N 2 ; x 1 N
N
2
= 4
2 xi2 2 4
=
6 xi2 + 4
i =1 i =1
SOC:
2 `N 2 ; x 1 N
N
4
=
b6

xi2 + b4
b2
i =1
2N b2 N 1 N
=
b
6
+ 4
b

b2 =
since
2N xi2
i =1
N
= <0
b4

Conclusion:
The maximum likelihood estimator (MLE) of the parameter 2 is dened
by:
1 N 2
2N i
b2 =
Xi
=1
The maximum likelihood estimate of the parameter 2 is equal to:

N
1
b 2 (x ) =

2N xi2
i =1
Example (Sample of normal variables)

We consider a sample fY1 , .., YN g N.i.d. m, 2 . Problem: what are
the MLE of m and 2 ?
|
Solution: Let us dene = m 2 .
b
= arg max `N (; y )
2 2R+ ,m 2R
with
N
N N 1
`N (; y ) =
2
ln 2
2
ln (2 )
22 (yi m )2
i =1
Solution (contd):
N
N N 1
`N (; y ) =
2
ln 2
2
ln (2 )
22 (yi m )2
i =1
The rst derivative of the log-likelihood function is dened by:

`N (;y )
!
`N (; y ) m
= `N (;y )

2
N N
`N (; y ) 1 `N (; y ) N 1
m
= 2
(yi m)
2
= +
22 24 (yi m )2
i =1 i =1
Solution (contd):
FOC (log-likelihood equations)
! !
`N (; y )
1
b2

N
i =1 (yi b)
m 0
= =
b
N
+ 1
N
i =1 (yi b )2
m 0
2
2b 4
2b
So, the MLE correspond to the empirical mean and variance:
b b
m
=
2
b
with
N N
1 1 2
b =
m
N Yi b2 =

N Yi YN
i =1 i =1
Solution (contd):
N N
`N (; y ) 1 `N (; y ) N 1
m
= 2
(yi m)
2
=
2 2
+ 4
2 (yi m )2
i =1 i =1
The Hessian matrix (realization) is:

2 `N (;y ) 2 `N (;y )
!
2 `N (; y ) m 2 m2
= 2 `N (;y ) 2 `N (;y )
>
2 m 4
!
N
2
1
4 N
i =1 (yi m)
= 1
4 N
i =1 (yi m) N
24
1
6 N
i =1 (yi m )2
Solution (contd): SOC

!
2 `N (; y )
N
b2

1
b4

N
i =1 (yi b)
m
=
> b
1
b4
N
i =1 (yi b)
m N 1
N
i =1 (yi b )2
m
4
2b b6

N
!
b2
0
=
N Nb2
0
4
2b b6

since since N mb = N i =1 yi and N b 2 = Ni =1 (yi mb )2

!
N
2 `N (; y ) b 2 0
=
N is denite negative
> b 0 4
2b


Consider the linear regression model:
yi = xi> + i
where xi = (xi 1 ...xiK )> and = ( 1 ..K )> are K 1 vectors. We assume
that the i are N .i.d. 0, 2 . Then, the (conditional) log-likelihood of the
observations (xi , yi ) is given by
N
N N 1 2
`N (; y j x ) =
2
ln 2
2
ln (2 )
22 yi xi>
i =1
where = ( > 2 )> is (K + 1) 1 vector. Question: what are the MLE

of and 2 ?
Notation 1: The derivative of a scalar y by a K 1 vector

x = (x1 ...xK )> is K 1 vector
0 1
y
y B x1 C
= @ .. A
x y
xK
Notation 2: If x and are two K 1 vectors, then:
x>
= x
(K ,1 )
Solution
N
N N 1 2
b
= arg max
2
ln 2
2
ln (2 )
22 yi xi>
2RK ,2 2R+ i =1
The rst derivative of the log-likelihood function is a (K + 1) 1 vector:

0 `N (; y jx )
1
0 1 1
`N (; y jx ) B C
`N (; y j x ) B .. C
=@

A=B `N (; y jx ) C
`N (; y jx ) B C
| {z } @ K A
2
(K +1 ) 1 `N (; y jx )
2
Solution (contd)
N
N N 1 2
b
= arg max
2
ln 2
2
ln (2 )
22 yi xi>
2RK ,2 2R+ i =1
The rst derivative of the log-likelihood function is a (K + 1) 1 vector:

N
`N (; y j x ) 1

= 2
|{z}
xi yi xi>
| {z } i =1 | {z }
(K ,1 )
(K ,1 ) (1,1 )
N
`N (; y j x ) N 1 2
2
=
22
+ 4
2 yi xi>
| {z } i =1 | {z }
(1,1 ) (1,1 )
Solution (contd):
FOC (log-likelihood equations)
0 1
`N (; y j x )
1
b2
N
i =1 xi yi xi> b
0K
= @
2 A=
b

N
+ 2b14 Ni=1 yi xi> b 0
2
2b
So, the MLE is dened by:
b

b
=
b2

! 1 !
N N N
1 2
b
= Xi Xi> Xi Yi b2 =

N Yi Xi> b

i =1 i =1 i =1
Solution (contd):
The Hessian is a (K + 1) (K + 1) matrix:
0 1
2 `N (; y j x ) 2 `N (; y j x )
B > 2 C
B | {z } C
B | {z } C
2
`N (; y j x ) B K K K 1 C
= B C
> B 2 `N (; y j x ) 2 `N (; y j x ) C
|
{z } BB
C
C
> 4
(K +1 ) (K +1 ) @ 2
|
{z } A
| {z } 1 1
1 K
Solution (contd):
N
`N (; y j x ) 1

= 2
xi yi xi>
i =1
N
`N (; y j x ) N 1 2
2
=
2 2
+ 4
2 yi xi>
i =1
So, the Hessian matrix (realization) is equal to:
0
1
2 N xi xi>
i =1 |{z}
1
4 N xi yi xi>
i =1 |{z}
B |{z} | {z }
2 `N (; y j x ) B K 1 1 K K 1
B 1 1
=B 2
> B >
x> xi>
@
1
N
i =1 xi y N 1
N
i =1 yi
4 |{z}| i {z i } 24 6
| {z }
1 K
1 1 1 1
Solution (contd):
Second Order Conditions (SOC)
0 1
2
`N ( )
1
b2
N i =1 xi xi
> 1
b4
N
i =1 xi yi xi> b

B C
=@ 2 A
> b 1
N > y xi> b N 1
N xi> b
b4 i =1 xi i
4
2b b6
i =1 yi
2
Since N
i =1 xi
> y
i xi> b b 2 = N
= 0 (FOC) and N i =1 yi xi> b

!
2 `N ( )
N
b2
N
i = 1 xi xi
> 0
=
N Nb2
> b
0
4
2b b6

Solution (contd):
Second Order Conditions (SOC).
!
2 `N (; y j x ) 1
b2
N
i =1 xi xi
> 0
=
N is denite negative
> b
0
4
2b
Since N >
i =1 xi xi is positive denite (assumption), the Hessian matrix is
denite negative and b is the MLE of the parameters .
Theorem (Equivariance or Invariance Principle)

Under suitable regularity conditions, the maximum likelihood estimator of
a function g (.) of the parameter is g b , where b
is the maximum
likelihood estimator of .
Invariance Principle
The MLE is invariant to one-to-one transformations of . Any

transformation that is not one to one either renders the model
inestimable if it is one to many or imposes restrictions if it is many to
one.
For the practitioner, this result is extremely useful. For example, when
a parameter appears in a likelihood function in the form 1/ , it is
usually worthwhile to reparameterize the model in terms of = 1/.
Example: Olsen (1978) and the reparametrisation of the likelihood

function of the Tobit Model.
Example (Invariance Principle)

Suppose that the normal log-likelihood in the previous example is
parameterized in terms of the precision parameter, 2 = 1/2 . The
log-likelihood
N
N N 1
`N m, 2 ; y =
2
ln 2
2
ln (2 )
22 (yi m )2
i =1
becomes
N
N N 2
`N m, 2 ; y =
2
ln 2
2
ln (2 )
2 (yi m )2
i =1
Example (Invariance Principle, contd)

The MLE for m is clearly still Y N . But the likelihood equation for 2 is
now:
`N m, 2 ; y N 1 N
2 i
= (yi m)2
2 22 =1
and the MLE for 2 is now dened by:
N 1
b2 =
2
=
N
i =1 (Yi m) b2

as expected.
Key Concepts
1 Identication.
2 Maximum likelihood estimator.
3 Maximum likelihood estimate.
4 Log-likelihood equations.
5 Equivariance or invariance principle.
6 Gradient Vector and Hessian Matrix (deterministic elements).
Section 5
Score, Hessian and Fisher Information
5. Score, Hessian and Fisher Information
Objectives
We aim at introducing the following concepts:
1 Score vector and gradient

2 Hessian matrix
3 Fischer information matrix of the sample
4 Fischer information matrix of one observation for marginal and
conditional distributions
5 Average Fischer information matrix of one observation
Denition (Score Vector)

The (conditional) score vector is a K 1 vector dened by:
`N (; Y j x )
sN (; Y j x ) s ( ) =
(K ,1 )
Remarks:
The score sN (; Y j x ) is a vector of random elements since it

depends on the random variables Y1 , .., .YN .
For an unconditional log-likelihood, `N (; x ) , the score is denoted by
sN (; X ) = `N (; X ) /
The score is a K 1 vector such that:

0 ` 1
N ( ; Y jx )
1
B C
sN (; Y j x ) = @ . A
`N (; Y jx )
K
Corollary
By denition, the score vector satises
E (sN (; Y j x )) = 0K
where E means the expectation with respect to the conditional

distribution Y j X = x.
Remark: If we consider a variable X with a pdf fX (x; ) , 8x 2 R, then

E (.) means the expectation with respect to the distribution of X :
Z
E (sN (; X )) = sN (; x ) fX (x; ) dx = 0

Remark: If we consider a variable Y with a conditional pdf f Y jx (y ; ) ,

8y 2 R, then E (.) means the expectation with respect to the
distribution of Y j X = x :
Z
E (sN (; Y j x )) = sN (; Y j x ) f Y jx (y ; ) dy = 0

Proof.
If we consider a variable X with a pdf fX (x; ) , 8x 2 R, then:
Z
E (sN (; X )) = sN (; x ) fX (x; ) dx
Z
ln fX (x; )
= N fX (x; ) dx

Z
1 fX (x; )
= N fX (x; ) dx
fX (x; )
Z

= N fX (x; ) dx

1
= N =0


Suppose that D1 , D2 , .., DN are i.i.d., positive random variable with
Di Exp ( ) and E (Di ) = > 0.
1 d
fD (d; ) = exp , 8d 2 R+

1 N
i
`N (; d ) = N ln ( ) di
=1
The score (scalar) is equal to:

N
N 1
sN (; D ) =

+ 2

Di
i =1
Example (Exponential Distribution, contd)

By denition:
!
N
N 1
E (sN (; D )) = E

+ 2

Di
i =1
N 1 N
= + 2 E ( Di )
i =1
N N
= + 2

= 0

Let us consider the previous linear regression model yi = xi> + i . The
score is dened by:
0 1
1 N >
2 i = 1 xi Y i xi
sN (; Y j x ) = @ A

N 1 N > 2
22
+ 24 i =1 Yi xi
Then, we have
0 1
1
2 N
i =1 xi Yi xi>
E (sN (; Y j x )) = E @ 2
A
N
22
+ 1
24 N
i =1 Yi xi>

We know that E ( Yi j x ) = xi> . So, we have:
1 N 1 N
E xi Yi xi> = xi E ( Yi j x ) xi>
2 i =1 2 i =1
1 N
= xi xi> xi>
2 i =1
= 0K
N 1 2
E 2
+ 4 Ni=1 Yi xi>
2 2
N 1 2
= + 4 Ni=1 E Yi xi>
22 2
N 1
= + 4 Ni=1 E (Yi E ( Yi j x ))2
22 2
N 1
= + 4 Ni=1 V ( Yi j x )
22 2
N N2
= + 4
22 2
= 0
Denition (Gradient)
The gradient vector associated to the log-likelihood function is a K 1
vector dened by:
`N (; y j x )
gN (; y j x ) g ( ) =
(K ,1 )
Remarks
1 The gradient gN (; y j x ) is a vector of deterministic entries since it

depends on the realisation y1 , .., yN .
2 For an unconditional log-likelihood, the gradient is dened by
gN (; x ) = `N (; x ) /
3 The gradient is a K 1 vector such that:

0 ` (; y jx ) 1
N
1
B C
gN (; y j x ) = @ . A
`N (; y jx )
K
Corollary
By denition of the FOC, the gradient vector satises
gN b
; y j x = 0K
where b
=b
(x ) is the maximum likelihood estimate of .
Example (Linear regression model)

In the linear regression model, the gradient associated to the log-likelihood
function is dened to be:
1 N >
!
2 i = 1 xi yi xi
gN (; y j x ) = N 1 N 2
22
+ 24 i =1
yi xi>
Given the FOC, we have:

0 1
!
1
b2
N
i =1 xi yi xi> b
0K
B C
gN b

; y j x =@ 2 A=
N
+ 1
N yi xi> b
0
2
2b 4
2b i =1
Denition (Hessian Matrix)

The Hessian matrix (deterministic) is dened as to be:
2 `N (; y j x )
HN (; y j x ) =
>
2 `N (; y jx )
Remarks: The matrix is also called the Hessian matrix, but do
>
2
` (; Y x ) 2 `N (; y jx )
not confuse the two matrices N > j and .
>
Random Variable Constant

`N (; Y jx ) `N (; y jx )
Score vector Gradient vector
2 `N (; Y jx ) 2 `N (; y jx )
Hessian Matrix Hessian Matrix
> >
Denition (Fisher Information Matrix)

The (conditional) Fisher information matrix associated to the sample
fY1 , .., YN g is the variance-covariance matrix of the score vector:
I N ( ) = V (sN (; Y j x ))
| {z }
K K
or equivalently:
`N (; Y j x )
I N ( ) = V

where V means the variance with respect to the conditional distribution
Y j X.
Corollary
Since by denition E (sN (; Y j x )) = 0, then an alternative denition of
the Fisher information matrix of the sample fY1 , .., YN g is:
0 1
B C
I N ( ) = E @sN (; Y j x ) sN (; Y j x )> A
| {z } | {z } | {z }
K K K 1 1 K

The (conditional) Fisher information matrix of the sample fY1 , .., YN g is
also given by:
2 `N (; Y j x )
I N ( ) = E = E ( HN (; Y j x ))
>
Denition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample fY1 , .., YN g
can alternatively be dened by:
I N ( ) = V (sN (; Y j x ))
I N ( ) = E sN (; Y j x ) sN (; Y j x )>
I N ( ) = E ( HN (; Y j x ))
where E and V denote the mean and the variance with respect to the
conditional distribution Y j X , and where sN (; Y j x ) denotes the score
vector and HN (; Y j x ) the Hessian matrix.
Denition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample fY1 , .., YN g
can alternatively be dened by:
`N (; Y j x )
I N ( ) = V

> !
`N (; Y j x ) `N (; Y j x )
I N ( ) = E

2 `N (; Y j x )
I N ( ) = E
>
where E and V denote the mean and the variance with respect to the
conditional distribution Y j X .
Remarks
1 Three equivalent denitions of the Fisher information matrix, and as a
consequence three dierent consistent estimates of the Fisher
information matrix (see later).
2 The Fisher information matrix associated to the sample fY1 , .., YN g
can also be dened from the Fisher information matrix for the
observation i.

The (conditional) Fisher information matrix associated to the i th
individual can be dened by:
ì (; Yi j xi )
I i ( ) = V

!
ì (; Yi j xi ) ì (; Yi j xi )>
I i ( ) = E

2 ì (; Yi j xi )
I i ( ) = E
>
where E and V denote the expectation and variance with respect to the
true conditional distribution Yi j Xi .

The (conditional) Fisher information matrix associated to the i th
individual can be alternatively be dened by:
I i ( ) = V (si (; Yi j xi ))
I i ( ) = E si (; Yi j xi ) si (; Yi j xi )>
I i ( ) = E ( Hi (; Yi j xi ))
where E and V denote the expectation and variance with respect to the
true conditional distribution Yi j Xi .
Theorem
The Fisher information matrix associated to the sample fY1 , .., YN g is
equal to the sum of individual Fisher information matrices:
N
I N ( ) = I i ( )
i =1
Remark:
1 In the case of a marginal log-likelihood, the Fisher information matrix
associated to the variable Xi is the same for the observations i :
I i ( ) = I ( ) 8i = 1, ..N
2 In the case of a conditional log-likelihood, the Fisher information

matrix associated to the variable Yi given Xi = xi depends on the
observation i :
I i ( ) 6 = I j ( ) 8i 6 = j
Example (Exponential marginal distribution)

Di Exp ( )
E ( Di ) = V ( Di ) = 2
1 d
fD (d; ) = exp , 8d 2 R+

di
ì (; di ) = ln ( )

Question: what is the Fisher information number (scalar) associated to
Di ?
Solution
di
` (; di ) = ln ( )

The score of the observation Xi is dened by:
ì (; Di ) 1 Di
si (; Di ) = = + 2

Let us use the three denitions of the information quantity I i ( ) :
I i ( ) = V (si (; Di ))
= E si (; Di )2
= E ( Hi (; Di ))
Solution, contd
ì (; Di ) 1 Di
si (; Di ) = = + 2

First denition:
I i ( ) = V (si (; Di ))
1 Di
= V + 2

1
= 4 V ( Di )

1
= 2

Conclusion: I i ( ) =I ( ) does not depend on i.
Solution, contd
ì (; Di ) 1 Di
si (; Di ) = = + 2

Second denition:
I i ( ) = E si (; Di )2
!
2
1 Di
= E + 2

1 Di 1 Di
= V + 2 since E + 2 =0

1
=
2
Solution, contd
ì (; Di ) 1 Di
si (; Di ) = = + 2

2 ì (; Di ) 1 2Di
Hi (; Di ) = 2
= 2
3
Third denition:
I i ( ) = E ( Hi (; Di ))
1 2Di
= E 2
3
1 2
= 2
+ 3 E ( Di )

1 2 1
= 2
+ 3 = 2

Example (Linear regression model)

We shown that:
0 1
1
xi xi>
2 |{z}
1
xi
4 |{z}
Yi xi>
B |{z} | {z } C
2 ì (; Yi j xi ) B K 1 1 K K 1 C
B 1 1 C
=B 2 C
> B 1
xi> Yi xi> 1 1
Yi xi > C
@ 4 |{z} 24 6 A
| {z } | {z }
1 K
1 1 1 1
Question: what is the Fisher information matrix associated to the

observation Yi ?
Solution
The information matrix is then dened by:
2 ì (; Yi j xi )
I i ( ) = E = E ( Hi (; Yi j xi ))
| {z } >
K +1 K +1

distribution Yi j Xi = xi
0 1
1
x x>
2 i i
1
x
4 i
E (Yi ) xi>
I i ( ) = @ 2
A
1 >
x
4 i
E (Yi ) xi> 1
24
+ 1
E
6
Yi xi>
Solution (contd)
0 1
1
x x>
2 i i
1
x
4 i
E (Yi ) xi>
I i ( ) = @ 2
A
1 >
x
4 i
E (Yi ) xi> 1
24
+ 1
E
6
Yi xi>
2
Given that E (Yi ) = xi> and E ( Yi xi> ) = 2 , then we have:
!
1 >
2
xi x i 0
I i ( ) = 1
0 24
Conclusion: I i ( ) depends on xi and I i ( ) 6=I j ( ) for i 6= j.
Denition (Average Fisher information matrix)

For a conditional model, the average Fisher information matrix for one
observation is dened by:
I ( ) = EX (I i ( ))
where EX denotes the expectation with respect to X (conditioning

variable).
Summary: For a conditional model (and only for a conditional model),

we have:
ì (; Yi j Xi )
I ( ) = EX V = EX (V (s (; Yi j Xi )))

!
ì (; Yi j Xi ) ì (; Yi j Xi )>
I ( ) = EX E

= EX E si (; Yi j Xi ) si (; Yi j Xi )>
2 ì (; Yi j Xi )
I ( ) = EX E = EX E ( Hi (; Yi j Xi ))
>
Summary: For a marginal distribution, we have:
ì (; Yi )
I ( ) = V = V (s (; Yi ))

!
ì (; Yi ) ì (; Yi )>
I ( ) = E

= E si (; Yi ) si (; Yi )>
2 ì (; Yi )
I ( ) = E = E ( Hi (; Yi ))
>

In the linear model, the individual Fisher information matrix is equal to:
!
1
x x> 0
2 i i
I i ( ) = 1
0 24
and the average Fisher information Matrix for one observation is dened
by: !
1
E Xi Xi>
2 X
0
I ( ) = EX (I i ( )) = 1
0 24
Summary: in order to compute the average information matrix I ( ) for
one observation:
Step 1: Compute the Hessian matrix or the score vector for one
observation
2 ì (; Yi j xi ) ì (; Yi j xi )
Hi (; Yi j xi ) = >
si (; Yi j xi ) =

Step 2: Take the expectation (or the variance) with respect to the
conditional distribution Yi j Xi = xi
I i ( ) = V (si (; Yi j xi )) = E ( Hi (; Yi j xi ))
Step 3: Take the expectation with respect to the conditioning variable X
I ( ) = EX (I i ( ))
Theorem
In a sampling model (with i.i.d. observations), one has:
IN ( ) = N I ( )
Marginal Distribution Cond. Distribution (model)

pdf fX i (; xi ) f Y i jxi (; y j x )
Score Vector si (; Xi ) si (; Yi j xi )
Hessian Matrix Hi (; Xi ) Hi (; Yi j xi )
Information matrix I i ( ) = I ( ) I i ( )
Av. Infor. Matrix I ( ) = I i ( ) I ( ) = EX (I i ( ))
with I i ( ) = V (si (; Yi j xi )) = E si (; Yi j xi ) si (; Yi j xi )> =

E ( Hi (; Yi j xi ))
How to estimate the average Fisher Information Matrix?
This matrix is particularly important, since we will see that its

corresponds to the asymptotic variance covariance matrix of the
MLE.
Let us assume that we have a consistent estimator b

of the parameter
, how to estimate the average Fisher information matrix?
Denition (Estimators of the average Fisher Information Matrix)

If b
converges in probability to 0 (true value), then:
N
1
bI b
=
N bI i b

i =1
> !
N
1 ì (; yi j xi ) ì (; yi j xi )
bI b
=
N b b
i =1
N
1 2 ì (; yi j xi )
bI b
=
N > b
i =1
are three consistent estimators of the average Fisher information matrix.
1 The rst estimator corresponds to the average of the N Fisher

information matrices (for Y1 , .., YN ) evaluated at the estimated value
b
. This estimator will rarely be available in practice.
2 The second estimator corresponds to the average of the product of
the individual score vectors evaluated at b
. It is known as the BHHH
(Berndt, Hall, Hall, and Hausman, 1994) estimator or OPG estimator
(outer product of gradients).
N >
1
bI b
=
N gi b
; yi j xi gi b
; yi j xi
i =1
3. The third estimator corresponds to the opposite of the average of the

Hessian matrices evaluated at b
.
N
1
bI b
=
N Hi b
; yi j xi
i =1
Problem
These three estimators are asymptotically equivalent, but they could give
dierent results in nite samples. Available evidence suggests that in small
or moderate sized samples, the Hessian is preferable (Greene, 2007).
However, in most cases, the BHHH estimator will be the easiest to
compute.
Example (CAPM)
The empirical analogue of the CAPM is given by:
e
rit = i + i e
rmt + t
e
rit = rit rft e
rmt = (rmt rft )
| {z } | {z }
excess return of security i at time t market excess return at time t
where t is an i.i.d. error term with:
E ( t ) = 0 V ( t ) = 2 E ( t j e
rmt ) = 0
Example (CAPM, contd)
Data (data le: capm.xls): Microsoft, SP500 and Tbill (closing prices)
from 11/1/1993 to 04/03/2003
0.10
0.08
0.05
0.04
RMSFT
0.00
0.00
-0.05
-0.04
-0.10
-0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 -0.08
500 1000 1500 2000
RSP500 RSP500 RMSFT

We consider the CAPM model rewritten as follows
rit = xt> + t
e t = 1, ..T
rmt )> is 2
where xt = (1 e 1 vector of random variables,
> >
>
= i : i : 2 = : 2 is 3 1 vector of parameters, and
where the error term t satises E (t ) = 0, V (t ) = 2 and
E ( t j e
rmt ) = 0.

Question: Compute three alternative estimators of the asymptotic
>
variance covariance matrix of the MLE estimator b
= b i b b2
i
! 1 !
T T
b
i
b=
b
i
= xt xt> xt erit
t =1 t =1
T
1 2
b2 =

T e
rit b
xt>
t =1
Solution The ML estimator is dened by:
T
T T 1 2
b
= arg max
2
ln 2
2
ln (2 )
22 e
rit b
xt>
2R2 ,2 2R+ t =1
The problem is regular, so we have:

p d
T b
0 ! N 0, I 1
(0 )
or equivalently
b
asy 1 1
N 0 , I (0 )
T
The asymptotic variance covariance matrix of b
is
1
V b
= I 1
(0 )
T
Solution (contd)
First estimator: The information matrix at time t is dened by (third
denition):
0 1
eit xt
2 `t ; R
I t () = E @ A = E eit xt
Ht ; R
>

eit Xt = xt
distribution R
0 1
1
x x> 1
x eit
E R xt>
2 t t 4 t
B C
I t () = @ 2 A
1 >
x eit
E R xt> 1
+ 1
E eit
R xt>
4 t 24 6
Solution (contd)
First estimator:
0 1
1
x x> 1
x eit
E R xt>
2 t t 4 t
B C
I t () = @ 2 A
1 >
x E eit
R xt> 1
+ 1
E eit
R xt>
4 t 24 6
2
eit
Given that E R = xt> and E eit
R xt> = 2 , then we have:
!
1
x x>
2 t t
02 1
I t () = 1
01 2 24
Solution (contd)
First estimator:
!
1
x x>
2 t t
02 1
I t () = 1
01 2 24
An estimator of the asymptotic variance covariance matrix of b

is given by:
1 1
V = bI
b asy b b

T
!
1 T 1
Tt=1 xt xt> 02 1
bI b
=
T It b
= b2
T
01 1
t =1 2 4
2b
Solution (contd)
Second denition (BHHH):
1 1
V = bI
b asy b b

T
> !
T
1 `t (; e
rit j xt ) `t (; e
rit j xt )
bI b
=
T b b
t =1
with
0 1
1 b !
`t (; e
rit j xt ) x
b2 t
e
rit xt> 1
x b
B C b2 t t

=@ 2 A=
b
1
+ 1
e
rit b
xt>
1
2
2b
+ 2b14 b2t
2
2b 4
2b
Solution (contd)
Second denition (BHHH):
>
`t (; e
rit j xt ) `t (; e
rit j xt )
b
b

1
!
b
x
b2 t t 1 > 1 1 2
=
x b
b2 t t
+ b
1
+ 2b14 b2t 2
2b 4 t
2b
2
2b
0 1
1
x x>b2
b4 t t t
1
x b
b2 t t
1
+ 1 2
b
2
2b 4 t
2b
= @
2 A
1 > 1 1 2 1 1 2
x b
b2 t t
+ b + b
2
2b 4 t
2b 2
2b 4 t
2b
Solution (contd)
Second denition (BHHH): so we have
1 1
V = bI
b asy b b

T
with
0 1
1 T 1
x x>b2
b4 t t t
1
x b
b2 t t
1
+ 1 2
b
2 4 t
@
bI b 2b 2b A
= 2
T t =1
1 >
x b 1
+ 1 2
b 1
+ 1 2
b
b2 t t
2
2b 4 t
2b 2
2b 4 t
2b
Solution (contd)
Third denition (inverse of the Hessian): we know that
1 1
V = bI
b asy b b

T
T
1
bI b
=
T Ht b
; e
rit j xt
t =1
0 1
1
x x> 1
x e
rit b
xt>
b2 t t b4 t
Ht b
; e
rit j xt =@
2 A
1 >
x e
rit b
xt> 1 1
e
rit b
xt>
b4 t
4
2b b6

Solution (contd)
Third denition (inverse of the Hessian):
0 1
1
x x> 1
x e
rit b
xt>
b2 t t b4 t
Ht b
; e
rit j xt =@
2 A
1 >
x e
rit b
xt> 1 1
e
rit b
xt>
b4 t
4
2b b6

Given the FOC (log-likelihood equations), Tt=1 xt e

rit b = 0 and
xt>
2
e
rit b
xt> b2 .
= T
!
T 1
Tt=1 xt xt> 02
Ht
1
b
; e
rit j xt = b2

T
01 2
t =1 4
2b
Solution (contd)
Third denition (inverse of the Hessian):
So, in this case, the third estimator of bI b
concides with the rst one:
1 1
V = bI
b asy b b

T
!
1 T 1
Tt=1 xt xt> 02

bI b 1
= Ht b
; e
rit j xt = b2
T
1
T 01 2
t =1 4
2b
Solution (contd)
These three estimates of the asymptotic variance covariance matrix
are asymptotically equivalent, but can be largely dierent in nite
sample...
1 1 b
b asy b
V = bI
T
with
1 T
T t
bI b = It b

=1
>!
T
1 ` ( ; e
r j x ) ` ( ; e
r j x )
T t
bI b t it t t it t
=
=1 b
b

T
1
bI b
=
T ( Ht (; e
rit j xt ))
t =1
Key Concepts
1 Gradient and Hessian Matrix (deterministic elements).

2 Score Vector (random elements).
3 Hessian Matrix (random elements).
4 Fisher information matrix associated to the sample.
5 (Average) Fisher information matrix for one observation.
Section 6
Properties of Maximum Likelihood Estimators
6. Properties of Maximum Likelihood Estimators
Objectives
MLE is a good estimator? Under which conditions the MLE is
unbiased, consistent and corresponds to the BUE (Best Unbiased
Estimator)? => regularity conditions
Is the MLE consistent?
Is the MLE optimal or e cient?
What is the asymptotic distribution of the MLE? The magic of the

MLE...
Denition (Regularity conditions)

Greene (2007) identify three regularity conditions
R1 The rst three derivatives of ln fX (; xi ) with respect to are
continuous and nite for almost all xi and for all . This condition
ensures the existence of a certain Taylor series approximation and the
nite variance of the derivatives of ì (; xi ).
R2 The conditions necessary to obtain the expectations of the rst and
second derivatives of ln fX (; Xi ) are met.
R3 For all values of , 3 ln fX (; xi ) / i j k is less than a function
that has a nite expectation. This condition will allow us to truncate the
Taylor series.
Denition (Regularity conditions, Zivot 2001)

A pdf fX (; x ) is regular if and only of:
R1 The support of the random variables X , SX = fx : fX (; x ) > 0g,
does not depend on .
R2 fX (; x ) is at least three times dierentiable with respect to , and
these derivatives are continuous.
R3 The true value of lies in a compact set .
Under these regularity conditions, the maximum likelihood estimator b

possesses many appealing properties:
1 The maximum likelihood estimator is consistent.

2 The maximum likelihood estimator is asymptotically normal (the
magic of the MLE..).
3 The maximum likelihood estimator is asymptotically optimal or
e cient.
4 The maximum likelihood estimator is equivariant: if b
is an estimator
of then g (b
) is an estimator of g ( ).
Theorem (Consistency)
Under regularity conditions, the maximum likelihood estimator is
consistent
p
b
! 0
N !
or equivalently:
p limb
= 0
N !
where 0 denotes the true value of the parameter .
Sketch of the proof (Greene, 2007)

Because b is the MLE, in any nite sample, for any 6= b
(including the
true 0 ) it must be true that
ln LN b
; y j x ln LN (; y j x )
Consider, then, the random variable LN (; Y j x ) /LN ( 0 ; Y j x ). Because

the log function is strictly concave, from Jensens Inequality, we have
LN (; Y j x ) LN (; Y j x )
E ln ln E
LN ( 0 ; Y j x ) LN ( 0 ; Y j x )
Sketch of the proof, contd

The expectation on the right-hand side is exactly equal to one, as
Z
LN (; Y j x ) LN (; y j x )
E = LN ( 0 ; y j x ) dy
LN ( 0 ; Y j x ) LN ( 0 ; y j x )
Z
= LN (; y j x ) dy
= 1
is simply the integral of a joint density.

So we have
LN (; Y j x ) LN (; Y j x )
E ln ln E = ln (1) = 0
LN ( 0 ; Y j x ) LN ( 0 ; Y j x )
Divide the left hand side of this equation by N to produce
1 1
E ln LN (; Y j x ) E ln LN ( 0 ; Y j x )
N N
This produces a central result:
Theorem (Likelihood Inequality)

The expected value of the log-likelihood is maximized at the true value of
the parameters. For any , including b:
1 1
E `N ( 0 ; Yi j xi ) E `N (; Yi j xi )
N N

Notice that
1 1
`N (; Yi j xi ) = Ni=1 ì (; Yi j xi )
N N
where the elements ì (; Yi j xi ) for i = 1, ..N are i.i.d.. So, using a law
of large numbers, we get:
1 p 1
`N (; Yi j xi ) ! E `N (; Yi j xi )
N N ! N
The Likelihood inequality for = b
implies
1 1
E `N ( 0 ; Yi j xi ) E `N b; Yi j xi
N N
with
1 p 1
`N ( 0 ; Yi j xi ) ! E `N ( 0 ; Yi j xi )
N N ! N
1 p 1
`N b; Yi j xi ! E `N b; Yi j xi
N N ! N
and thus
1 1
lim Pr `N ( 0 ; Yi j xi ) `N b; Yi j xi =1
N ! N N
Sketch of the proof, contd So we have two results:
1 1
lim Pr `N ( 0 ; Yi j xi ) `N b; Yi j xi =1
N ! N N
1 1
`N b; Yi j xi `N ( 0 ; Yi j xi ) 8N
N N
It necessarily implies that
1 p 1
`N b; Yi j xi ! `N ( 0 ; Yi j xi )
N N ! N
If is a scalar, we have immediatly:
p
b
! 0
N !
For a more general case with dim ( ) = K , see a formal proof in Amemiya
(1985).
Amemiya T., (1985) Advanced Econometrics. Harvard University Press
Remark
The proof of the consistency of the MLE is largely easiest when we have a
formal expression for the maximum likelihood estimator b
b
=b
(X1 , .., XN )
Example
Di Exp ( 0 ), with
1 d
fD (d; ) = exp , 8d 2 R+

E ( Di ) = 0 V (Di ) = 20
where 0 is the true value of . Question: show that the MLE is
consistent.
Solution
The log-likelihood function associated to the sample fd1 , .., dN g is dened
by:
1 N
i
`N (; d ) = N ln ( ) di
=1
We admit that maximum likelihood estimator corresponds to the sample
mean:
b 1
= N Di
N i =1
Solution, contd
Then, we have:
1
E b
= N E ( Di ) = b
is unbiased
N i =1
1 2
V b
= 2 N
i =1 V ( D i ) =
N N
As a consequence
E b
= lim V b
=0
N !
and
p
b
!
N !
Lemma
Under stronger conditions, the maximum likelihood estimator converges
almost surely to 0
a.s . p
b ! 0 =) b ! 0
N ! N !
1 If we restrict ourselves to the class of unbiased estimators (linear and

nonlinear) then we dene the best estimator as the one with the
smallest variance.
2 With linear estimators (next chapter), the Gauss-Markov theorem tells
us that the ordinary least squares (OLS) estimator is best (BLUE).
3 When we expand the class of estimators to include linear and
nonlinear estimators it turns out that we can establish an absolute
lower bound on the variance of any unbiased estimator b of under
certain conditions.
4 Then if an unbiased estimator b
has a variance that is equal to the
lower bound then we have found the best unbiased estimator
(BUE).
Denition (Cramer-Rao or FDCR bound)

Let X1 , .., XN be an i.i.d. sample with pdf fX (; x ). Let b be an unbiased
estimator of ; i.e., E (b
) = . If fX (; x ) is regular then
V b
I N 1 ( 0 ) FDCR or Cramer-Rao bound
where I N ( 0 ) denotes the Fisher information number for the sample

evaluated at the true value 0 .
Remarks
1 Hence, the Cramer-Rao Bound is the inverse of the information matrix
associated to the sample. Reminder: three denitions for I N ( 0 ) .
!
`N (; Y j x )
I N ( 0 ) = V
0
!
`N (; Y j x ) `N (; Y j x )>
I N ( 0 ) = E
0
0
!
2 `N (; Y j x )
I N ( 0 ) = E
> 0
2 If is a vector then V b
I N 1 ( 0 ) means that V b
I N 1 ( 0 )
is positive semi-denite
Theorem (E ciency)
Under regularity conditions, the maximum likelihood estimator is
asymptotically e cient and attains the FDCR (Frechet - Darnois -
Cramer - Rao) or Cramer-Rao bound:
V b
= I N 1 ( 0 )
where I N ( 0 ) denotes the Fisher information matrix associated to the

sample evaluated at the true value 0 .

Di Exp ( 0 ), with
1 d
fD (d; ) = exp , 8d 2 R+

E ( Di ) = 0 V (Di ) = 20
where 0 is the true value of . Question: show that the MLE is e cient.
Solution
We shown that the maximum likelihood estimator corresponds to the
sample mean,
1 N
N i
b
= Di
=1
2
V b
= 0
N
E b
= 0
Solution, contd
The log-likelihood function is
1 N
i
`N (; d ) = N ln ( ) di
=1
The score vector is dened by:

N
`N (; D ) N 1
sN (; D ) =

=

+ 2

Di
i =1
Solution, contd
Let us use one of the three denitions of the information quantity I N ( ) :
`N (; D )
I N ( ) = V

!
N
N 1
= V

+ 2

Di
i =1
1 N
= i = 1 V ( Di )
4
N 2 N
= 4
= 2

Then, b
is e cient and attains the Cramer-Rao bound.
2
V b
= I N 1 ( 0 ) =
N
Theorem (Convergence of the MLE)

Under suitable regularity conditions, the MLE is asymptotically normally
distributed with
p d
N b 0 ! N 0, I 1 ( 0 )
where 0 denotes the true value of the parameter and I ( 0 ) the (average)
Fisher information matrix for one observation.
Corollary
Another way, to write this result, is to say that for large sample size N, the
MLE b is approximatively distributed according a normal distribution
asy
b
N 0 , N 1
I 1
( 0 )
or equivalently
asy
b
N 0 , I N 1 ( 0 )
where I N ( 0 ) = N I ( 0 ) denotes the Fisher information matrix
associated to the sample.
Denition (Asymptotic Variance)

The asymptotic variance of the MLE is dened by:
Vasy b
= I N 1 ( 0 )
where I N ( 0 ) denotes the Fisher information matrix associated to the

sample. This asymptotic variance of the MLE corresponds to the
Cramer-Rao or FDCR bound.
The magic of the MLE
Proof (MLE convergence)
At the maximum likelihood estimator, the gradient of the log-likelihood
equals zero (FOC):
`N (; y j x )
gN b
gN b
; y j x = = 0K
b

(K ,1 )
where b
=b (x ) denotes here the ML estimate. Expand this set of
equations in a Taylor series around the true parameters 0 . We will use the
mean value theorem to truncate the Taylor series at the second term:
gN b
= gN ( 0 ) + HN b
0 = 0
The Hessian is evaluated at a point that is between b

and 0 , for
b
instance = + (1 ) 0 for some 0 < < 1.
Proof (MLE convergence, contd)
p
We then rearrange this equation and multiply the result by N to obtain:
p 1 p
N b 0 = HN NgN ( 0 )
By dividing HN and gN ( 0 ) by N, we obtain:
p 1 p
1 1
N b
0 = HN N gN ( 0 )
N N
1 p
1
= HN Ng ( 0 )
N
where g ( 0 ) denotes the sample mean of the individual gradient vectors
N
1 1
g ( 0 ) =
N
gN ( 0 ) =
N gi ( 0 ; yi j xi )
i =1

Let us now consider the same expression in terms of random variables: b

now denotes the ML estimator, HN = HN ; Y j x and sN ( 0 ; Y j x )
the score vector. We have:
p 1 p
1
N b
0 = HN ; Y j x Ns ( 0 ; Y j x )
N
where the score vectors associated to the variables Yi are i.i.d.

N
1
s ( 0 ; Y j x ) =
N si ( 0 ; Yi j xi )
i =1

Let us consider the rst element:
N
1
s ( 0 ) =
N si ( 0 ; Yi j xi )
i =1
The individual scores si ( 0 ; Yi j xi ) are i.i.d. with
E (si ( 0 ; Yi j xi )) = 0
Ex V (si ( 0 ; Yi j xi )) = Ex (I i ( 0 )) = I ( 0 )
By using the Lindberg-Levy Central Limit Theorem, we have:
p d
Ns ( 0 ) ! N (0, I ( 0 ))
We known that:
N
1 1
N
HN ; Y j x =
N Hi ; Yi j xi
i =1
where the hessian matrices Hi ; Yi j xi are i.i.d. Besides, because

plim b
0 = 0, plim 0 = 0 as well. By applying a law of large
numbers, we get:
1 p
HN ; Y j x ! EX E ( Hi ( 0 ; Yi j xi ))
N
with
2 ì (; Yi j xi )
EX E ( Hi ( 0 ; Yi j xi )) = EX E = I ( 0 )
>
Reminder:
If XN and YN verify
p
XN ! X
(K ,K ) (K ,K )
d
YN !N 0 ,
(K ,1 ) (K ,1 ) (K ,K )
then
d
XN YN !N 0 , X X>
(K ,K )(K ,1 ) (K ,1 ) (K ,K )(K ,K )(K ,K )

Here we have
p 1 p
1
N b
0 = HN ; Y j x Ns ( 0 ; Y j x )
N
1
1 p
HN ; Y j x ! I 1 ( 0 ) symmetric matrix
N
p d
Ns ( 0 ) ! N (0, I ( 0 ))
Then, we get:
p d
N b
0 ! N 0, I 1
( 0 ) I ( 0 ) I 1
( 0 )

And nally....
p d
N b
0 ! N 0, I 1
( 0 )
The magic of the MLE.....

Di Exp ( 0 ), with
1 d
fD (d; ) = exp , 8d 2 R+

E ( Di ) = 0 V (Di ) = 20
where 0 is the true value of . Question: what is the asymptotic
distribution of the MLE? Propose a consistent estimator of the asymptotic
variance of b.
Solution
We shown that b
= (1/N ) N
i =1 Di and:
ì (; Di ) 1 Di
si (; Di ) = = + 2

The (average) Fisher information matrix associated to Di is:
1 Di 1 1
I ( ) = V + 2 = V D = 2
4 ( i)

Then, the asymptotic distribution of b
is:
p d
N b 0 ! N 0, 2
or equivalently !
b
asy 2
N 0 ,
N
Solution, contd
The asymptotic variance of b
is:
2
Vasy b
=
N
A consistent estimator of Vas b

is simply dened by:
2
b

b asy
V b
=
N

Let us consider the previous linear regression model yi = xi> + i , with i
N .i.d. 0, 2 . Let us denote the K + 1 1 vector dened by
>
= > 2 . The MLE estimator of is dened by:
b

b
=
b2

! 1 !
N N N
1 2
b
= Xi Xi> Xi> Yi b2 =

N Yi Xi> b

i =1 i =1 i =1
Question: what is the asymptotic distribution of b

? Propose an estimator
of the asymptotic variance.
Solution
This model satisfy the regularity conditions. We shown that the average
Fisher information matrix is equal to:
1
E
2 X
Xi Xi> 0
I ( ) = 1
0 24
From the MLE convergence theorem, we get immediately:

p d
N b
0 ! N 0, I 1
( 0 )
where 0 is the true value of .
Solution, contd
The asymptotic variance covariance matrix of b
is equal to:
Vasy b
=N 1
I 1
( 0 ) = I N 1 ( 0 )
with
N
E
2 X
Xi Xi> 0
I N ( ) = N
0 24
Solution, contd
A consistent estimate of I N ( ) is:
!
N b
Q
b2 X
0
b asy1 b
bI N ( ) = V =
N
0
4
2b
with
N
bX = 1
Q xi xi>
N i =1
Solution, contd
Thus we get:
asy 1
b
N b 2 N
0 , i =1 xi xi
>
!
2 asy 4
2b
b
N 20 ,
N
Summary
Under regular conditions
1 The MLE is consistent.

2 The MLE is asymptotically e cient and its variance attains the
FDCR or Cramer-Rao bound.
3 The MLE is asymptotically normally distributed.
But, nite sample properties can be very dierent from large sample
properties:
1 The maximum likelihood estimator is consistent but can be severely

biased in nite samples
2 The estimation of the variance-covariance matrix can be seriously
doubtful in nite samples.
Theorem (Equivariance)
Under regular conditions and if g (.) is a continuously dierentiable
function of and is dened from RK to RP , then:
p
g b
! g ( 0 )
p d
N g b
g ( 0 ) ! N 0, G ( 0 ) I 1
( 0 ) G ( 0 )>
where 0 is the true value of the parameters and the matrix G ( 0 ) is

dened by
g ( )
G ( ) =
(P ,K ) >
Key Concepts of the Chapter 2
1 Likelihood and log-likelihood function

2 Maximum likelihood estimator (MLE) and Maximum likelihood
estimate
3 Gradient and Hessian Matrix (deterministic elements)
4 Score Vector and Hessian Matrix (random elements)
5 Fisher information matrix associated to the sample
6 (Average) Fisher information matrix for one observation
7 FDCR or Cramer Rao Bound: the notion of e ciency
8 Asymptotic distribution of the MLE
9 Asymptotic variance of the MLE
10 Estimator of the asymptotic variance of the MLE
End of Chapter 2
Christophe Hurlin (University of Orlans)

Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne

Enviado por

Direitos autorais:

Formatos disponíveis

Chapter 2: Maximum Likelihood Estimation

Advanced Econometrics - HEC Lausanne

The Maximum Likelihood Estimation (MLE) is a method of

The method of maximum likelihood selects the set of values of the

The Maximum-likelihood Estimation gives an unied approach to

What are the main properties of the maximum likelihood estimator?

How to apply the maximum likelihood principle to the multiple linear

... All of these questions are answered in this lecture...

The outline of this chapter is the following:

Amemiya T. (1985), Advanced Econometrics. Harvard University Press.

Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil

Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne (a

The Principle of Maximum Likelihood

1 To introduce the notations

Question: What is the probability of observing the particular sample

Pr ((X1 = x1 ) \ ... \ (XN = xN ))

Given the pmf of the Poisson distribution, we have:

LN (; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))

LN (; x1 .., xN ) = Pr ((X1 = x1 ) \ ... \ (XN = xN ))

Question: What value of would make this sample most probable?

Consider maximizing the likelihood function LN (; x1 .., xN ) with respect to

Under suitable regularity conditions, the maximum likelihood estimate

The maximum likelihood estimate (realization) is:

Given the sample f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , we have b

The maximum likelihood estimator (random variable) is:

The likelihood function then corresponds to the pdf associated to the

LN (; x1 .., xN ) = fX 1 ,..,X N (x1 , x2 , .., xN ; )

where fX (xi ; ) denotes the pdf of the marginal distribution of X (or

The values of the parameters that maximize LN (; x1 .., xN ) or its log

The Likelihood function

Denitions and Notations

1 Introduce the notations for an estimation problem that deals with a

Let us consider a continuous random variable X , with a pdf denoted

= ( 1 .. K )| is a K 1 vector of unknown parameters. We assume

Let us consider a sample fX1 , .., XN g of i.i.d. random variables with

Example (Normal distribution)

Denition (Likelihood Function)

Denition (Log-Likelihood Function)

Remark: the (log-)likelihood function depends on two type of arguments:

Example (Sample of Normal Variables)

Denition (Likelihood of one observation)

Example (Exponential Distribution)

Remark: The (log-)likelihood and the Maximum Likelihood Estimator are

Yi Distribution with pdf fY (y ; ) =) LN (; y ) and `N (; y )

In practice, generally we have no idea about the true distribution of Yi ....

Remark: We can also use the MLE to estimate the parameters of a

where denotes the vector or parameters, X a set of explicative variables,

Let us consider two continuous random variables Y and X

= ( 1 .. K )| is a K 1 vector of unknown parameters. We assume

Let us consider a sample fX1 , YN gN

Denition (Conditional likelihood function)

where f Y jX ( yi j xi ; ) denotes the conditional pdf of Yi given Xi .

Remark: The conditional likelihood function is the joint conditional

Denition (Conditional log-likelihood function)

where f Y jX ( yi j xi ; ) denotes the conditional pdf of Yi given Xi .

Remark: The conditional probability density function (pdf) can denoted

where Xi is a K 1 vector of random variables and = ( 1 ..K )> a

Example (Linear Regression Model, contd)

Remark: Given this principle, we can derive the (conditional) likelihood

Example (Probit/Logit Models)

where = ( 1 ..K )> is a vector of parameters and F (.) is a cdf

1 with probability F xi>

Denition (Probit Model)