Maximum Likelihood Estimation of Heckman's Sample Selection Model

Maximum likelihood estimation of Heckmans
sample selection model

Herman J. Bierens
October 2007
1 Heckmans sample selection model
1.1 Introduction
Heckmans sample selection model
1
is based on two latent dependent vari-
ables models:
Y
1
=
0
X +U
1
, (1)
Y
2
=
0
Z +U
2,
(2)
where X and Z are vectors of regressors, possibly containing common com-
ponents, including intercepts, and the errors U
1
and U
2
are, conditional on
X and Z, jointly bivariate normally distributed with zero mean vector and
variance matrix .
The model for Y
1
is the one we are interested in, but Y
1
is only observable
if Y
2
> 0. Thus the observed dependent variable Y is
Y = Y
1
if Y
2
> 0,
Y = missing value if Y
2
0.
However, the Zs are observable if Y is a missing value, and the Xs are
observable if the Y s are.
1
Heckman, James J. (1979): Sample Selection Bias as a Specication Error, Econo-
metrica 47, 153-161. (Heckman got the Nobel prize for this article.)
1
The variance matrix can be written as
=
0
,
where is an upper-triangular matrix:
=

1

2
0
3
Consequently, we can write

U
1
=
1
e
1
+
2
e
2
,
U
2
=
3
e
2
,
where e
1
and e
2
are independent standard normally distributed. Thus the
latent dependent variables models (1) and (2) can be written as
Y
1
=
0
X +
1
e
1
+
2
e
2
, (3)
Y
2
=
0
Z +
3
e
2
. (4)
Without loss of generality we may assume that
1
> 0, and since only the
sign of Y
2
plays a role, we may set
3
= 1. Then the conditional probability
of a missing value of Y is:
P[Y
2
0|Z, X] = P[e
2

0
Z]
= 1 P[e
2

0
Z]
= 1 P[e
2

0
Z]
= 1 F (
0
Z) ,
where F is the distribution function of the standard normal distribution, i.e.,
F(x) =
Z
x
f(u)du, (5)
with
f(x) =
exp [x
2
/2]
2
. (6)
Thus, from now on I will assume that
1
> 0,
3
= 1.
2
Let D be a dummy variable taking the value 1 if Y is observed, and 0 if
not. Then
P [ D = 1| Z, X] = F (
0
Z) . (7)
The distribution function of Y conditional on the event D = 1 and X and
Z is now given by
H(y|X, Z) = P [ Y y| D = 1, X, Z] (8)
=
P [ Y y and D = 1| X, Z]
P [ D = 1| X, Z]
=
P [ Y
1
y and Y
2
> 0| X, Z]
F (
0
Z)
=
P [
2
e
2
y
0
X
1
e
1
and
0
Z < e
2
| X, Z]
F (
0
Z)
1.2 The case
2
> 0
In order to evaluate expression (8) further, and derive the corresponding
conditional density, assume rst that
2
> 0. Then (8) times F (
0
Z) becomes
F (
0
Z) .H(y|X, Z) = P [
0
Z < e
2
(y
0
X
1
e
1
)/
2
| X, Z]
=
Z

P [
0
Z < e
2
(y
0
X
1
u)/
2
| X, Z]
f(u)du
=
Z
(y
0
X+
2
0
Z)/
1
[F ((y
0
X
1
u)/
2
) F (
0
Z)]
f(u)du
=
Z
(y
0
X+
2
0
Z)/
1
F ((y
0
X
1
u)/
2
) f(u)du
F (
0
Z) F ((y
0
X +
2
0
Z)/
1
)
=

2
1
Z

0
Z
F (v) f((y
0
X
2
v)/
1
)dv
F (
0
Z) F ((y
0
X +
2
0
Z)/
1
)
=
Z

0
Z
F (v)
F((y
0
X
2
v)/
1
)
v
dv
F (
0
Z) F ((y
0
X +
2
0
Z)/
1
)
3
= F (v) F((y
0
X
2
v)/
1
)|
0
Z
+
Z

0
Z
F ((y
0
X
2
v)/
1
) f(v)dv
F (
0
Z) F ((y
0
X +
2
0
Z)/
1
)
=
Z

0
Z
F ((y
0
X
2
v)/
1
) f(v)dv
The fth equality follows by substituting
u = (y
0
X
2
v)/
1
,
and the last two equalities follow from integration by parts.
The corresponding conditional density is now
h(y|X, Z) =
H(y|X, Z)
y
=
1
1
F (
0
Z)
Z

0
Z
f ((y
0
X
2
v)/
1
) f(v)dv
It can be shown (see Appendix 1) that for the standard normal density
f,
Z

c
f(a +b.x)f(x)dx =
f

a/
b
2
+ 1
b
2
+ 1
(9)
h
1 F

c
b
2
+ 1 + ab/
b
2
+ 1
i
Substituting c =
0
Z, a = (y
0
X)/
1
and b =
2
/
1
, i.e.,
1
b
2
+ 1
=

1
p
2
1
+
2
2
,
a
b
2
+ 1
=
y
0
X
p
2
1
+
2
2
,
c
b
2
+ 1 +
ab
b
2
+ 1
=
2
(y
0
X) + (
2
1
+
2
2
)
0
Z
1
p
2
1
+
2
2
,
it follows therefore that
h(y|X, Z) =
f

(y
0
X)/
p
2
1
+
2
2
2
1
+
2
2
F (
0
Z)
(10)
4
1 F
2
(y
0
X) + (
2
1
+
2
2
)
0
Z
1
p
2
1
+
2
2
!!
=
f

(y
0
X)/
p
2
1
+
2
2
2
1
+
2
2
F (
0
Z)
F
2
(y
0
X) + (
2
1
+
2
2
)
0
Z
1
p
2
1
+
2
2
!
1.3 The case
2
< 0
If
2
< 0 then (8) times F (
0
Z) becomes
F (
0
Z) .H(y|X, Z) = P [
2
e
2
y
0
X
1
e
1
and
2
0
Z >
2
e
2
| X, Z]
= P [
2
e
2
min((y
0
X
1
e
1
), |
2
|
0
Z)]
= P [|
2
|e
2
min ((y
0
X
1
e
1
), |
2
|
0
Z)]
= P [e
2
min ((y
0
X
1
e
1
)/|
2
|,
0
Z)]
=
Z

F (min ((y
0
X
1
u)/ |
2
| ,
0
Z)) f(u)du
= F (
0
Z)
Z
(y
0
X|
2
|
0
Z)/
1
f(u)du
+
Z

(y
0
X|
2
|
0
Z)/
1
F ((y
0
X
1
u)/ |
2
|) f(u)du
= F (
0
Z) F ((y
0
X |
2
|
0
Z)/
1
)
+
|
2
|
1
Z

0
Z
F (v) f((y
0
X |
2
| v)/
1
)dv
= F (
0
Z) F ((y
0
X |
2
|
0
Z)/
1
)
Z

0
Z
F (v)
F((y
0
X |
2
| v)/
1
)
v
dv
= F (
0
Z) F ((y
0
X |
2
|
0
Z)/
1
)
F (v) F((y
0
X |
2
| v)/
1
)|
0
Z
+
Z

0
Z
F((y
0
X |
2
| v)/
1
)f (v) dv
=
Z

0
Z
F((y
0
X |
2
| v)/
1
)f (v) dv
5
The corresponding conditional density is now
h(y|X, Z) =
H(y|X, Z)
y
=
1
1
F (
0
Z)
Z

0
Z
f((y
0
X |
2
| v)/
1
)f (v) dv
It can be shown (see Appendix 1) that for the standard normal density
f,
Z
c
f(a +b.x)f(x)dx =
f

a/
b
2
+ 1
b
2
+ 1
(11)
F

c
b
2
+ 1 +ab/
b
2
+ 1
Substituting c =
0
Z, a = (y
0
X)/
1
and b = |
2
|/
1
, i.e.,
1
b
2
+ 1
=

1
p
2
1
+
2
2
,
a
b
2
+ 1
=
y
0
X
p
2
1
+
2
2
,
c
b
2
+ 1 +
ab
b
2
+ 1
=
p
2
1
+
2
2
1

0
Z +
|
2
| (y
0
X)
1
p
2
1
+
2
2
=

2
(y
0
X) (
2
1
+
2
2
)
0
Z
1
p
2
1
+
2
2
,
it follows that
h(y|X, Z) =
f

(y
0
X)/
p
2
1
+
2
2
2
1
+
2
2
F (
0
Z)
F
2
(y
0
X) + (
2
1
+
2
2
)
0
Z
1
p
2
1
+
2
2
!
which is the same as in the case
2
> 0.
6
1.4 The conditional density of the observed Y
Next, substitute
1
=
p
1
2
,
2
= ,
which correspond to
=
0
=

2
1
+
2
2

2
2
1
=

2
,
where
2
is the variance of U
1
and (1, 1) is the correlation between U
1
and U
2
. Then (10) simplies to:
h(y|X, Z, , , , ) =
f ((y
0
X)/)
F (
0
Z)
.F
(y
0
X)/ +
0
Z
p
1
2
!
. (12)
The case
2
= 0 corresponds to = 0:
h(y|X, Z, , , , ) = f ((y
0
X)/) /,
which is just the conditional density of Y
1
.
2 Sample selection bias
The conditional expectation corresponding to (12) is
E [Y |D = 1, X, Z] =
0
X +
f (
0
Z)
F (
0
Z)
(13)
and the conditional variance involved is
V ar [Y |D = 1, X, Z] =
2
0
Z +
f (
0
Z)
F (
0
Z)
f (
0
Z)
F (
0
Z)
.. (14)
See Appendix 2. Thus
E[Y |D = 1, X] =
0
X + E [ f (
0
Z) /F (
0
Z)| X] . (15)
The second termis the cause of the sample selection bias of the OLS estimator
of if Y is regressed on X using the valid observations on Y only.
Note that if X and Z are independent then
E[ f (
0
Z) /F (
0
Z)| X] = E[f (
0
Z) /F (
0
Z)]
is constant, and therefore only aects the intercept.
7
3 The log-likelihood function and score vec-
tor
Let for j = 1, ..., n, D
j
= 1 if Y
j
is observed, and D
j
= 0 if not. The
regressors X
j
R
k
are observable if the corresponding Y
j
are observable,
and the Z
j
R
`
are observable for all j. It will be assumed that the data
involved is a random sample with non-response for Y
j
if D
j
= 0.
Without loss of generality we may assume that Y
j
= 0 if D
j
= 0. The
actual dependent variable is now the pair (D
j
, D
j
Y
j
), with joint conditional
distribution given by
d
dy
P [D
j
= 1, D
j
Y
j
y|X
j
, Z
j
]
=
d
dy
P [Y
j
y|D
j
= 1, X
j
, Z
j
] P [D
j
= 1|X
j
, Z
j
]
= h(y|X, Z, , , , )F (
0
Z)
and
P [D
j
= 0, D
j
Y
j
= 0|X
j
, Z
j
] = P [D
j
= 0|X
j
, Z
j
] = 1 F (
0
Z)
Then the log-likelihood takes the form
ln L() =
n
X
j=1
(1 D
j
) ln (1 F(
0
Z
j
)) +
n
X
j=1
D
j
ln (F (
0
Z
j
)) (16)
+
n
X
j=1
D
j
ln (h(Y
j
|X
j
, Z
j
, , , , )) ,
where
= (
0
,
0
, , )
0
. (17)
The corresponding score vector ln L()/ is .
ln L()
0
=
n
X
j=1
j
() (18)
where
j
() =

D
j
f (
0
Z
j
)
F (
0
Z
j
)
(1 D
j
)
f(
0
Z
j
)
1 F(
0
Z
j
)
0
k
Z
j
0
0
(19)
8
+D
j
ln (h(Y
j
|X
j
, Z
j
, , , , ))
0
with 0
k
a k-vectorsof zeros. The partial derivative vector in (19) is derived
in Appendix 3.
Moreover, recall from maximum likelihood theory that for the true para-
meter vector
0
,
H = lim
n
E
"
1
n

2
ln L()
=
0
#
= lim
n
1
n
n
X
j=1
E[
j
(
0
)
j
(
0
)
0
] ,
and that under some regularity conditions the maximum likelihood estimator
b
of
0
satises
b

0
N

0, H
1
in distribution, where H can be consistently estimated by

b
H =
1
n
n
X
j=1
j
(
b
)
j
(
b
)
0
.
4 Initial parameter estimates
The log-likelihood function (16) is highly nonlinear in the parameters, and
even more so are the components of the score vector ln L()/ (see Appen-
dix 3) and the elements of the Hessian matrix

2
lnL()
0
. Moreover, the latter
matrix may not be negative denite for all values of (at least I could not
verify this). Therefore, EasyReg maximizes the log-likelihood function (16)
by using the simplex method of Nelder and Mead, which only requires evalu-
tions of (16) itself. However, this method is rather slow, and if the Hessian is
not negative denite for all values of one may get stuck in a local optimum.
Therefore, it is important to start the simplex iteration from a starting value
of already close to the true parameter value
0
. Such a starting value
e
,
say, can be derived as follows.
The parameter vector can be estimated by Probit analysis. Given the
Probit estimator e , say, the parameter vector and the parameter =
can be estimated by regressing Y
j
on X
j
and f

e
0
Z
j
/F

e
0
Z
j
for the
observations j for which D
j
= 1, with OLS estimators
e
and e , and residual
ev
j
:
Y
j
=
e
0
X
j
+ e f

e
0
Z
j
/F

e
0
Z
j
+ev
j
.
9
Now (14) suggests to estimate
2
by
e
2
=
1
m
m
X
j=1
D
j
"
ev
2
j
+ e
2
1
m
m
X
j=1
e
0
Z
j
f

e
0
Z
j
F

e
0
Z
j
+
1
m
m
X
j=1
f

e
0
Z
j
2
F

e
0
Z
j
2
!#
=
1
m
m
X
j=1
D
j
ev
2
j
+ e
2
1
m
m
X
j=1
D
j
e
0
Z
j
F

e
0
Z
j
+f

e
0
Z
j
f

e
0
Z
j
F

e
0
Z
j
2
!
.
where
m =
n
X
j=1
D
j
.
Note that e
2
e
2
, because
inf
u
[uF(u) +f (u)] = lim
u
[uF(u) +f (u)] = 0.
Finally, can be estimated by
e = e /
p
e
2
.
Let
e
= (
e
0
, e
0
, e , e)
0
. Under some regularity conditions (one of them is
that m/n (0, 1) as n ) it can be shown that
e

0
= O
p
1/
,
where
0
is the true parameter vector.
5 Appendix 1: Products of normal densities
Let f(x) be the standard normal density. Then
f(a +b.x)f(x) =
exp
1
2
(a +bx)
2
1
2
x
2
2
=
exp
1
2
(a
2
+ 2abx + (1 +b
2
)x
2
)
2
=
exp
h
1
2
a
2
1+b
2
+ 2
ab
1+b
2
x +x
2
(1 +b
2
)
i
2
10
=
exp
h
1
2
ab
1+b
2
2
+ 2
ab
1+b
2
x +x
2
(1 +b
2
)
i
2
exp
"
1
2
a
2
1 +b
2

ab
1 +b
2
2
!
(1 + b
2
)
#
=
exp
h
1
2
x +
ab
1+b
2
2
/
1
1+b
2
i
1
1+b
2
exp
h
1
2
a
2
1+b
2
i
1 +b
2
2
= f

x +
ab
1 +b
2
/
1
1 +b
2
f

a/
1 +b
2
Hence:
Z
c
f(a +b.x)f(x)dx (20)

=
Z
c
f

x +
ab
1 +b
2
/
1
1 +b
2
dx f

a/
1 +b
2
=
Z
c
f

x +
ab
1 +b
2
/
1
1 +b
2
x +
ab
1 +b
2
f

a/
1 +b
2
=
Z
c+ab/(1+b
2
)
f

u
1 +b
2
du
f

a/
1 +b
2
=
Z
c+ab/(1+b
2
)
f

u
1 +b
2
1 + b
2
f

a/
1 +b
2
1 +b
2
= F

c
1 +b
2
+
ab
1 +b
2
f

a/
1 +b
2
1 +b
2
This result proves (11).
Setting c = in (20) it follows that
Z

f(a +b.x)f(x)dx =
f

a/
1 +b
2
1 +b
2
, (21)
11
hence
Z

c
f(a +b.x)f(x)dx
=
Z

f(a +b.x)f(x)dx
Z
c
f(a +b.x)f(x)dx
=
1 F

c
1 +b
2
+
ab
1 +b
2
f

a/
1 +b
2
1 + b
2
.
This result proves (9).
6 Appendix 2: The conditional moment gen-
erating function and its derivatives
In order to derive the conditional expectation E [Y |D = 1, X, Z] and the
conditional variance V ar [Y |D = 1, X, Z] we now compute the moment gen-
erating function of the conditional density h(y|X, Z, , , , ):
m(|X, Z, , , , )
=
Z

exp(y)h(y|X, Z, , , , )dy
=
Z

exp(y)
f ((y
0
X)/)
F (
0
Z)
F
(y
0
X)/ +
0
Z
p
1
2
!
dy
=
exp(
0
X)
F (
0
Z)
Z

exp(u)f (u) F
u +
0
Z
p
1
2
!
du
=
exp(
0
X +
2
2
/2)
F (
0
Z)
Z

f (u ) F
u +
0
Z
p
1
2
!
du
=
exp(
0
X +
2
2
/2)
F (
0
Z)
Z

f (u) F
u + +
0
Z
p
1
2
!
du
The fourth equality follows from
exp(u)f (u) =
exp [(
2
2
2u +u
2
) /2]
2
12
exp
2
/2
= exp
2
/2
f(u )
Thus,
m(|X, Z, , , , )
(22)
= (
0
X +
2
)
exp(
0
X +
2
2
/2)
F (
0
Z)
Z

f (u) F
u + +
0
Z
p
1
2
!
du
+
exp(
0
X +
2
2
/2)
F (
0
Z)
p
1
2
Z

f (u) f
u + +
0
Z
p
1
2
!
du
= (
0
X +
2
)m(|X, Z, , , , )
+ exp(
0
X +
2
2
/2)
f ( +
0
Z)
F (
0
Z)
.
The last equality follows from (20) with c = , a = ( +
0
Z)/
p
1
2
,
b = /
p
1
2
:
Z

f (u) f
u + +
0
Z
p
1
2
!
du =
p
1
2
f ( +
0
Z) .
Moreover, it follows from (22) and the easy equality f
0
(u) = uf(u) that
2
m(|X, Z, , , , )
()
2
(23)
=
2
m(|X, Z, , , , )
+(
0
X +
2
)
m(|X, Z, , , , )
+(
0
X +
2
) exp(
0
X +
2
2
/2)
f ( +
0
Z)
F (
0
Z)
2
( +
0
Z) exp(
0
X +
2
2
/2)
f ( +
0
Z)
F (
0
Z)
.
Hence
E[Y |D = 1, X, Z] =
m(|X, Z, , , , )
=0
13
=
0
X +
f (
0
Z)
F (
0
Z)
,
E

Y
2
|D = 1, X, Z
=

2
m(|X, Z, , , , )
()
2
=0
=
2
+
0
X +
f (
0
Z)
F (
0
Z)
0
X
+.
0
X
f (
0
Z)
F (
0
Z)

2
0
Z
f (
0
Z)
F (
0
Z)
=
2
+ (
0
X)
2
+ 2.
0
X
f (
0
Z)
F (
0
Z)

2
2
(
0
Z)
f (
0
Z)
F (
0
Z)
=
2
+
0
X +
f (
0
Z)
F (
0
Z)
2
(
0
Z)
f (
0
Z)
F (
0
Z)

2
2
f (
0
Z)
2
F (
0
Z)
2
,
and thus
V ar [Y |D = 1, X, Z]
= E

Y
2
|D = 1, X, Z
(E[Y |D = 1, X, Z])
2
=
2
2
(
0
Z)
f (
0
Z)
F (
0
Z)

2
2
f (
0
Z)
2
F (
0
Z)
2
7 Appendix 3: The score vector
In order to derive the score vector
lnL()
0
, I will derive the rst-order partial
derivatives of
lnh(y|X, Z, , , , ) (24)
= ln f ((y
0
X)/) + lnF
(y
0
X)/ +
0
Z
p
1
2
!
ln F (
0
Z) ln .
Using the easy equality f
0
(u) = uf(u), it follows from (24) that
14
ln h(y|X, Z, , , , )
0
= ((y
0
X)/)
(y
0
X)/
0
+
f

(y
0
X)/+
0
Z
1
2
F

(y
0
X)/+
0
Z
1
2
(y
0
X)/ +
0
Z
p
1
2
!
=
1
y
0
X

f

(y
0
X)/+
0
Z
1
2
F

(y
0
X)/+
0
Z
1
2

p
1
2
X
=
1
y
0
X
(y
0
X)/ +
0
Z
p
1
2
!

p
1
2
!
X ,
where
g(u) =
f(u)
F(u)
. (25)
Moreover using the notation (25) it is easy to verify from (24) that
lnh(y|X, Z, , , , )
0
=
"
1
p
1
2
g
(y
0
X)/ +
0
Z
p
1
2
!
g (
0
Z)
#
Z,
ln h(y|X, Z, , , , )
= g
(y
0
X)/ +
0
Z
p
1
2
!

(y
0
X)/ +
0
Z
p
1
2
!
= g
(y
0
X)/ +
0
Z
p
1
2
!
(y
0
X)/
p
1
2
!
15
+g
(y
0
X)/ +
0
Z
p
1
2
!
((y
0
X)/ +
0
Z)
(1
2
)
p
1
2
= g
(y
0
X)/ +
0
Z
p
1
2
!
(y
0
X)/ +
0
Z
(1
2
)
p
1
2
and
ln h(y|X, Z, , , , )
= ((y
0
X)/)
((y
0
X)/)
+g
(y
0
X)/ +
0
Z
p
1
2
!

(y
0
X)/ +
0
Z
p
1
2
!
=
1
"
((y
0
X)/)
2
1

p
1
2
((y
0
X)/)
g
(y
0
X)/ +
0
Z
p
1
2
!#
Given these partial derivatives, the results (18) and (19) follow.
16

Maximum Likelihood Estimation of Heckman's Sample Selection Model

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Maximum Likelihood Estimation of Heckman's Sample Selection Model

Enviado por

Direitos autorais:

Formatos disponíveis

Maximum likelihood estimation of Heckmans

sample selection model

Consequently, we can write

in distribution, where H can be consistently estimated by

f(a +b.x)f(x)dx (20)

Você também pode gostar