Você está na página 1de 42

Speech Recognition

Lecture 6: Statistical Language Models -


Maximum Entropy Models

Mehryar Mohri
Courant Institute of Mathematical Sciences
mohri@cims.nyu.edu
This Lecture
Relative entropy
Maximum entropy models
Training algorithm
Duality theorem

Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU


Entropy
• Definition: the entropy of a random variable X
with probability distribution p(x) = Pr[X = x] is
!
H(X) = −E[log p(X)] = − p(x) log p(x).

• Properties:
x

• Measure of uncertainty of p(x).


• H(X) ≥ 0.
• Maximal for uniform distribution. For a finite
support, by Jensen’s inequality:
H(X) = E[log 1/p(X)] ≤ log E[1/p(X)] = log N.
Mehryar Mohri - Advanced Machine Learning page 3 Courant Institute, NYU
Relative Entropy
• Definition: the relative entropy (or Kullback- q
Leibler divergence) of two distributions p and is
p(X) ! p(x)
D(p ! q) = Ep [log ]= p(x) log ,
q(X) x
q(x)
0 p
with 0 log = 0 and p log = ∞.
0
• Properties:
q

• Assymetric measure of deviation between two


distributions. It is convex in p and q .
• D(p ! q) ≥ 0 for all p and q .
• D(p ! q) = 0 iff p = q.
Mehryar Mohri - Advanced Machine Learning page 4 Courant Institute, NYU
Jensen’s Inequality
• Theorem: let X be a random variable and f a
measurable convex function. Then,
f (E[X]) ≤ E[f (X)].
• Proof:
• For a distribution over a finite set, the property
follows directly the definition of convexity.

• The general case is a consequence of the


continuity of convex functions and the density
of finite distributions.

Mehryar Mohri - Advanced Machine Learning page 5 Courant Institute, NYU


Applications
• Non-negativity of relative entropy:
q(X)
−D(p " q) = Ep [log p(X) ]
q(X)
≤ log Ep [ p(X) ]
!
= log x q(x) = 0.

Mehryar Mohri - Advanced Machine Learning page 6 Courant Institute, NYU


Hoeffding’s Bounds
• Theorem: let X1, X2, . . . , Xm be a sequence of
independent Bernoulli trials taking values in [0, 1],
then for all ! >
! 0 , the following inequalities hold
m
for X m = m i=1 Xi
1

−2m!2
Pr[X m − E[X m ] ≥ !] ≤ e .
−2m!2
Pr[X m − E[X m ] ≤ −!] ≤ e .

Mehryar Mohri - Advanced Machine Learning page 7 Courant Institute, NYU


This Lecture
Relative entropy
Maximum entropy models
Training algorithm
Duality theorem

Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU


Problem
Data: sample drawn i.i.d. from set X according to
some distribution D,
x1 , . . . , xm ∈ X.
Problem: find distribution p out of a set P that
best estimates D .

Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU


Maximum Likelihood
Likelihood: probability of observing sample under
distribution p ∈ P , which, given the independence
assumption is
!
m
Pr[x1 , . . . , xm ] = p(xi ).
i=1
Principle: select distribution maximizing sample
probability !m
p! = argmax p(xi ),
p∈P i=1
!m
or p! = argmax log p(xi ).
p∈P i=1
Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU
Relative Entropy Formulation
Empirical distribution: distribution p̂ that assigns
to each point the frequency of its occurrence in
the sample.
Lemma: p! has maximum likelihood l(p! ) iff
p! = argmin D(p̂ ! p).
p∈P
Proof: ! !
D(p̂ ! p) = p̂(zi ) log p̂(zi ) − p̂(zi ) log p(zi )
observed
zi zi observed
" count(zi )
= −H(p̂) − zi log p(zi )
# N0
= −H(p̂) − N0 log zi p(zi )count(zi )
1

= −H(p̂) − N10 l(p).


Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU
Maximum A Posteriori (MAP)
Principle: select the most likely hypothesis h ∈ H
given the sample, with some prior distribution over
the hypotheses, Pr[h],
h! = argmax Pr[h | S]
h∈H
Pr[S | h] Pr[h]
= argmax
h∈H Pr[S]
= argmax Pr[S | h] Pr[h].
h∈H

Note: for a uniform prior, MAP coincides with


maximum likelihood.

Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU


Problem
Data: sample drawn i.i.d. from set X according to
some distribution D,
x1 , . . . , xm ∈ X.
Features: mappings associated to elements of X ,
f1 , . . . , fn : X → R.
Problem: how do we estimate distribution D ?
Uniform distribution u over X ?

Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU


Features
Examples:

• n-grams, distance-d n-grams.


• class-based n-grams, word triggers.
• sentence length.
• number and type of verbs.
• various grammatical information (e.g., agreement,
POS tags).

• dialog-level information.
Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU
Maximum Entropy Principle
(E. T. Jaynes, 1957, 1983)
For large m , we can give a fairly good estimate of
the expected values of the features:
1 !m
ED [fj ] ≈ fj (xi ), j = 1, . . . , n.
m i=1
Find distribution that is closest to the uniform
distribution u and that preserves the expected
values of features.
Closeness is measured using relative entropy (or
Kullback-Liebler divergence).

Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU


Maximum Entropy Formulation
Distributions: let P denote the set of distributions
! m
#
" 1 "
P= p: p(x)fj (x) = fj (xi ), j = 1, . . . , n .
m i=1
x∈X

Optimization problem: find distribution p!


verifying
p! = argmin D(p ! u).
p∈P

Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU


Relation with Entropy Maximization
Relationship with entropy:
! p(x)
D(p ! u) = x∈X p(x) log
! 1/|X|
= log |X| + x∈X p(x) log p(x)
= log |X| − H(p).
Optimization problem:
!
minimize !x∈X p(x) log p(x)
subject to !x∈X p(x) = 1 !m
1
x∈X p(x)fj (x) = m i=1 fj (xi )
j = 1, . . . , n.

Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU


Maximum Likelihood Gibbs Distrib.
Gibbs distributions: set Q of distributions defined
by
1 1 !n
p(x) = exp(λ · f (x)) = exp( λj · fj (x)),
Z Z j=1
!
with Z = exp(λ · f (x)).
x
Maximum likelihood Gibbs distribution:
!m
p! = argmax log q(xi ) = argmin D(p̂ ! q),
q∈Q̄ i=1 q∈Q̄

where Q̄ is the closure of Q .


Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU
Duality Theorem
(Della Pietra et al., 1997)
Theorem: Assume that D(p̂ ! u) < ∞. Then, there
exists a unique probability distribution p! satisfying
1. p! ∈ P ∩ Q̄;
2. D(p ! q) = D(p ! p! ) + D(p! ! q) for any p ∈ P
and q ∈ Q̄ (Pythagorean equality);
3. p! = argmin D(p̂ ! q) (maximum likelihood);
q∈Q̄
4. p! = argmin D(p ! u)
p∈P

Each of these properties determines p! uniquely.


Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU
Regularization
Overfitting:

• Features with low counts;


• Very large feature weights.
Gaussian priors: penalize large feature weights
(weight prior).
p̃(x) ← pλ (x) Pr[λ]
˜l(λ) ← l(λ) − ! n λ 2
1
!n 2
j
j=1 2σj2 − 2 j=1 log(2πσ j ).
!n λ2
with Pr[λ] = j=1
√ 1
2πσ 2
exp(− 2σ
j
2 ).
j j
Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU
This Lecture
Relative entropy
Maximum entropy models
Training algorithm
Duality theorem

Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU


Training Algorithm
Convex function, but no closed-form solution.
Generalized iterative scaling (GIS).
Improved iterative scaling (IIS).
Second-order convex optimization methods.

Mehryar Mohri - Speech Recognition page 22 Courant Institute, NYU


Generalized Iterative Scaling (GIS)
(Darroch and Ratcliff, 1972)
Requirement: add feature function fn+1 such that
!
n+1
∀j, fj ≥ 0, and ∀i ∈ [1, m], fj (xi ) = C.
j=1
Parameter update:
! (0)
λj ← arbitrary
(t+1) (t) 1 Ep̂ [fj ]
λj ← λj + C log E (t) [fj ] , ∀j ∈ [1, n + 1].
p

Convergence: slow for large C .

Mehryar Mohri - Speech Recognition page 23 Courant Institute, NYU


Improved Iterative Scaling (IIS)
(Della Pietra et al., 1997)
Requirements: non-negative feature functions fj
and D(p̂ ! u) < ∞.
Parameter update:
!
(0)
λj ← arbitrary
(t+1) (t) (t)
λj ← λj + δj , ∀j ∈ [1, n + 1],
!
with p(t) (x)fj (x)eδj f# (x) = Ep̂ [fj ],
x∈X
!
n
∀x ∈ X, f# (x) = fj (x).
j=1

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU


IIS - Convergence Proof
Ideas:

• Use of auxiliary functions (as for EM algorithm).


• Two known inequalities for convex functions:
• ∀x >!0, − log(x) ≥ 1 ! − x.

• exp( x p(x)α(x)) ≤ x p(x) exp(α(x)) ,


which follows Jensen’s inequality.

Mehryar Mohri - Speech Recognition page 25 Courant Institute, NYU


IIS - Convergence Proof
Gibbs distribution:
1 λ·f (x)
pλ (x) = e .

Log-likelihood:
!
m !m
l(pλ ) = log pλ (xi ) = [− log Zλ + λ · f (xi )].
i=1 i=1
Auxiliary function A(δ, p):
l(pλ+δ ) − l(pλ ) ≥ A(δ, q).
∀p, A(0, p) = 0, and ∇A(0, p) = 0 iff p ∈ P.

Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU


IIS - Convergence Proof
Lower bound on log-likelihood variation:
!m Zλ+δ
l(pλ+δ ) − l(pλ ) = i=1 δ · f (xi ) − m logZλ
P λ·f (x) δ·f (x)
xe e
= mEp̂ [f ] · δ −m log Zλ
! δ·f (x)
= mδ · Ep̂ [f ] −m log !x pλ (x)e
(log inequality) ≥ mδ · Ep̂ [f ] +m − m x pλ (x)eδ·f (x)
! !n δj fj (x)
= mδ · Ep̂ [f ] + m − m x pλ (x) exp( j=1 f# (x) f# (x))
(Jensen’s inequality) ! !n fj (x) δj f# (x)
≥ mδ · Ep̂ [f ] + m − m x pλ (x) j=1 f# (x) e
n
" " fj (x) δj f# (x)
= m[δ · Ep̂ [f ] + 1 − pλ (x) e ].
x
f
j=1 #
(x)
# $% &
A(δ,pλ )
Maximizing lower bound:
∂A !
= Ep̂ [fj ] − pλ (x)fj (x)eδj f# (x) = 0.
∂δj x
" #$ %
Epλ [fj eδj f# ]
Mehryar Mohri - Speech Recognition page 27 Courant Institute, NYU
IIS - Convergence Proof
(t)
Let p be the sequence of distributions defined
by the algorithm. By Tychonov’s theorem, the set
of distributions over X is compact. Thus, this
(tk )
sequence admits a subsequence p converging
to some p . By definition of Q̄ , is in Q̄ .
! !
p
It remains to show that p is in P . By definition of !

the maximization of A, for any δ ≥ 0,


(tk ) (tk ) (tk ) (tk+1 ) (tk )
A(δ, p ) ≤ A(δ ,p ) ≤ L(p ) − L(p ),
(tk )
since L(p ) is monotonically increasing.

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU


IIS - Convergence Proof
Taking the limit and using the continuity of A yields
∀δ ≥ 0, A(δ, p! ) ≤ 0.
But A(0, p) = 0 for all p , in particular for p = p . !

Thus, A(δ, p ) is maximum for δ = 0, ∇A(0, p ) = 0 ,


! !

and p ∈ P.
!

Thus, p !
∈ P ∩ Q̄. By the duality theorem, = p! .
p !

The uniqueness of the convergence point implies in


(t)
fact that the sequence p is converging. Otherwise,
(t)
infinitely many p s would fall outside an open ball
centered in p! and no subsequence of that infinite
set could converge to p! .
Mehryar Mohri - Speech Recognition page 29 Courant Institute, NYU
Improved Iterative Scaling (IIS)
Binary features (fj (x) ∈ {0, 1}):

• f#(x)is the total number of features on.


• Equation

defining δj is a polynomial equation
in j
.

• Newton’s method to find zeros of a function.


Monte-Carlo methods for large X.
Coincides with GIS when f# = C.

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU


Comparison
IIS typicaly faster than GIS, particularly when C is
large.
Second-order convex optimization (but
computation of the inverse of a large Hessian).
First-order methods (conjugate gradient)?
(Malouf 2002)

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU


Regularization
Overfitting:

• Features with low counts;


• Very large feature weights.
Gaussian priors: penalize large feature weights
(weight prior).
p̃(x) ← pλ (x) Pr[λ]
˜l(λ) ← l(λ) − ! n λ 2
1
!n 2
j
j=1 2σj2 − 2 j=1 log(2πσ j ).
!n λ2
with Pr[λ] = j=1
√ 1
2πσ 2
exp(− 2σ
j
2 ).
j j
Mehryar Mohri - Speech Recognition page 32 Courant Institute, NYU
IIS with Gaussian Priors
Auxiliary function:
(λj +δj )2 (λj )2
Ã(δ) ← A(δ) + 2σ2 − 2σj2
j
∂ Ã ∂A λj +δj
∂δj ← ∂δj + σj2 .

!Parameter update:
(0)
λj ← arbitrary
(t+1) (t) (t)
λj ← λj + δj , ∀j ∈ [1, n + 1],
! λj + δ j
with (t)
p (x)fj (x)e δj f# (x)
+ 2 = Ep̂ [fj ].
σj
x∈X
Mehryar Mohri - Speech Recognition page 33 Courant Institute, NYU
This Lecture
Relative entropy
Maximum entropy models
Training algorithm
Duality theorem

Mehryar Mohri - Speech Recognition page 34 Courant Institute, NYU


Duality Theorem
Lemma1: If D(p̂ ! u) < ∞ , P ∩ Q̄ =
" ∅.
Proof: Let Φ be the continuous function defined by
Φ : Q̄ → R
q "→ D(p̂ # q).
Since D(p̂ ! u) < ∞ , the set S defined by
! "
S = q ∈ Q̄ : D(p̂ " q) ≤ D(p̂ " u)
is closed and bounded. Thus Φ attains its minimum
on S ⊆ Q̄ , at a unique point p! because Φ is
strictly convex (property of relative entropy).
Mehryar Mohri - Speech Recognition page 35 Courant Institute, NYU
Duality Theorem
It remains to show that p! is in P. Define
1 λ·f (x)
Ψ:R→R with ∀x, qλ (x) = e ,

λ "→ D(p̂ # qλ+λ! ) and qλ! = p" .

Ψ is continuously differentiable and attains its


minimum at λ = 0. Thus,(∇Ψ)(0) = 0.
! !
Ψ(λ) = constant + p̂(x)[log e(λ+λ! )·f (x) − (λ + λ" ) · f (x)]
x x
! P (λ+λ" )·f (x)
xe f (x)
∇Ψ = x p̂(x)[
P (λ+λ" )·f (x) − f (x)]
xe
= Eqλ+λ" [f ] − Ep̂ [f ].
Mehryar Mohri - Speech Recognition page 36 Courant Institute, NYU
Duality Theorem
Lemma 2: ∀(p, q) ∈ P × Q̄, D(p $ q) = D(p $ p! ) + D(p! $ q).
Proof:
!Q̄, D(p $ p! ) + D(p! $ q) !
∀(p, q) ∈ P ×
= −H(p) + !x p(x) log p! (x) − H(p! ) + x p! (x) log q(x)
=!−H(p) + x p(x)[− log Zλ! + λ! ·! f (x)]
− x p! (x)[−!log Zλ! + λ! · f (x)] + x p! (x) log q(x)
= −H(p) + !x p! (x) log q(x) (Ep [f ] = Ep! [f ])
= −H(p) + !x p! (x)[− log Zλ + λ · f (x)]
= −H(p) + x p(x)[− log Zλ + λ · f (x)] (Ep! [f ] = Ep [f ])
= D(p $ q).
Note: ∀q ∈ Q̄, Ep [log q] = Ep! [log q].

Mehryar Mohri - Speech Recognition page 37 Courant Institute, NYU


Duality Theorem
Proof: By lemma 1, there exists a unique p!
verifying property 1. By lemma 2, p! verifies
property 2. Property 2 implies:
D(p̂ ! q) = D(p̂ ! p! ) + D(p! ! q) ≥ D(p̂ ! p! ) and
D(p ! u) = D(p ! p! ) + D(p! ! u) ≥ D(p! ! u).
Each of the properties determines p! uniquely. For
example, if p! verifies property 1, then by lemma 2,
∀(p, q) ∈ P × Q̄, D(p $ q) = D(p $ p! ) + D(p! $ q).
Thus, D(p! ! p! ) = D(p! ! p! ) + D(p! ! p! ) = 0,
and p! = p! . Idem if p! verifies property 2.
Mehryar Mohri - Speech Recognition page 38 Courant Institute, NYU
Duality Theorem
If p verifies property 3, then D(p̂ ! p ) ≤ D(p̂ ! p! ).
! !

Since D(p̂ ! p! ) = D(p̂ ! p! ) + D(p! ! p! ) , this


implies D(p! ! p ) = 0 .
!

If verifies property 4, then D(p ! u) ≤ D(p! ! u).


p ! !

SinceD(p ! u) = D(p ! p! ) + D(p! ! u) , this


! !

implies D(p ! p! ) = 0 .
!

Mehryar Mohri - Speech Recognition page 39 Courant Institute, NYU


Bregman Divergences
Definition: let F be a convex and differentiable
function, then the Bregman divergence based on F
is defined as
∆F (y, x) = F (y) − F (x) − (y − x) · ∇x F (x).

Examples:

• Unnormalized relative entropy. F (y)

• Euclidean distance.
F (x) + (y − x) · ∇x F (x) x y
Mehryar Mohri - Speech Recognition page 40 Courant Institute, NYU
References
• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, (22-1), March 1996;

• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American
Statistical Association 39, 357–365.

• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures.
Statistics and Decisions, Supplement Issue 1, 205-237, 1984.

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative


scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.

• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals
of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.

• Stephen Della Pietra,Vincent Della Pietra, and John Lafferty. Inducing features of random
fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393,
April, 1997.

Mehryar Mohri - Speech Recognition page 41 Courant Institute, NYU


References
• Stephen Della Pietra,Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions
for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science,
CMU, 2001.

• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630,


1957.

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor),


D. Reidel Publishing Company, 1983.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-


maximization. Codes, Curves and Signals: Common Threads in Communications, A.Vardy,
(editor), Kluwer, 1998.

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.


Computer, Speech and Language 10:187--228, 1996.

Mehryar Mohri - Speech Recognition page 42 Courant Institute, NYU

Você também pode gostar