Lecture 6 PDF

Speech Recognition
Lecture 6: Statistical Language Models -

Maximum Entropy Models
Mehryar Mohri
Courant Institute of Mathematical Sciences
mohri@cims.nyu.edu
This Lecture
Relative entropy
Maximum entropy models
Training algorithm
Duality theorem
Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU

Entropy
• Definition: the entropy of a random variable X
with probability distribution p(x) = Pr[X = x] is
!
H(X) = −E[log p(X)] = − p(x) log p(x).
• Properties:
x
• Measure of uncertainty of p(x).

• H(X) ≥ 0.
• Maximal for uniform distribution. For a finite
support, by Jensen’s inequality:
H(X) = E[log 1/p(X)] ≤ log E[1/p(X)] = log N.
Mehryar Mohri - Advanced Machine Learning page 3 Courant Institute, NYU
Relative Entropy
• Definition: the relative entropy (or Kullback- q
Leibler divergence) of two distributions p and is
p(X) ! p(x)
D(p ! q) = Ep [log ]= p(x) log ,
q(X) x
q(x)
0 p
with 0 log = 0 and p log = ∞.
0
• Properties:
q
• Assymetric measure of deviation between two

distributions. It is convex in p and q .
• D(p ! q) ≥ 0 for all p and q .
• D(p ! q) = 0 iff p = q.
Jensen’s Inequality
• Theorem: let X be a random variable and f a
measurable convex function. Then,
f (E[X]) ≤ E[f (X)].
• Proof:
• For a distribution over a finite set, the property
follows directly the definition of convexity.
• The general case is a consequence of the

continuity of convex functions and the density
of finite distributions.

Applications
• Non-negativity of relative entropy:
q(X)
−D(p " q) = Ep [log p(X) ]
q(X)
≤ log Ep [ p(X) ]
!
= log x q(x) = 0.

Hoeffding’s Bounds
• Theorem: let X1, X2, . . . , Xm be a sequence of
independent Bernoulli trials taking values in [0, 1],
then for all ! >
! 0 , the following inequalities hold
m
for X m = m i=1 Xi
1
−2m!2
Pr[X m − E[X m ] ≥ !] ≤ e .
−2m!2
Pr[X m − E[X m ] ≤ −!] ≤ e .

This Lecture
Relative entropy
Training algorithm
Duality theorem

Problem
Data: sample drawn i.i.d. from set X according to
some distribution D,
x1 , . . . , xm ∈ X.
Problem: find distribution p out of a set P that
best estimates D .

Maximum Likelihood
Likelihood: probability of observing sample under
distribution p ∈ P , which, given the independence
assumption is
!
m
Pr[x1 , . . . , xm ] = p(xi ).
i=1
Principle: select distribution maximizing sample
probability !m
p! = argmax p(xi ),
p∈P i=1
!m
or p! = argmax log p(xi ).
p∈P i=1
Relative Entropy Formulation
Empirical distribution: distribution p̂ that assigns
to each point the frequency of its occurrence in
the sample.
Lemma: p! has maximum likelihood l(p! ) iff
p! = argmin D(p̂ ! p).
p∈P
Proof: ! !
D(p̂ ! p) = p̂(zi ) log p̂(zi ) − p̂(zi ) log p(zi )
observed
zi zi observed
" count(zi )
= −H(p̂) − zi log p(zi )
# N0
= −H(p̂) − N0 log zi p(zi )count(zi )
1
= −H(p̂) − N10 l(p).

Maximum A Posteriori (MAP)
Principle: select the most likely hypothesis h ∈ H
given the sample, with some prior distribution over
the hypotheses, Pr[h],
h! = argmax Pr[h | S]
h∈H
Pr[S | h] Pr[h]
= argmax
h∈H Pr[S]
= argmax Pr[S | h] Pr[h].
h∈H
Note: for a uniform prior, MAP coincides with

maximum likelihood.

Problem
Data: sample drawn i.i.d. from set X according to
some distribution D,
x1 , . . . , xm ∈ X.
Features: mappings associated to elements of X ,
f1 , . . . , fn : X → R.
Problem: how do we estimate distribution D ?
Uniform distribution u over X ?

Features
Examples:
• n-grams, distance-d n-grams.

• class-based n-grams, word triggers.
• sentence length.
• number and type of verbs.
• various grammatical information (e.g., agreement,
POS tags).
• dialog-level information.
Maximum Entropy Principle
(E. T. Jaynes, 1957, 1983)
For large m , we can give a fairly good estimate of
the expected values of the features:
1 !m
ED [fj ] ≈ fj (xi ), j = 1, . . . , n.
m i=1
Find distribution that is closest to the uniform
distribution u and that preserves the expected
values of features.
Closeness is measured using relative entropy (or
Kullback-Liebler divergence).

Maximum Entropy Formulation
Distributions: let P denote the set of distributions
! m
#
" 1 "
P= p: p(x)fj (x) = fj (xi ), j = 1, . . . , n .
m i=1
x∈X
Optimization problem: find distribution p!

verifying
p! = argmin D(p ! u).
p∈P

Relation with Entropy Maximization
Relationship with entropy:
! p(x)
D(p ! u) = x∈X p(x) log
! 1/|X|
= log |X| + x∈X p(x) log p(x)
= log |X| − H(p).
Optimization problem:
!
minimize !x∈X p(x) log p(x)
subject to !x∈X p(x) = 1 !m
1
x∈X p(x)fj (x) = m i=1 fj (xi )
j = 1, . . . , n.

Maximum Likelihood Gibbs Distrib.
Gibbs distributions: set Q of distributions defined
by
1 1 !n
p(x) = exp(λ · f (x)) = exp( λj · fj (x)),
Z Z j=1
!
with Z = exp(λ · f (x)).
x
Maximum likelihood Gibbs distribution:
!m
p! = argmax log q(xi ) = argmin D(p̂ ! q),
q∈Q̄ i=1 q∈Q̄
where Q̄ is the closure of Q .

Duality Theorem
(Della Pietra et al., 1997)
Theorem: Assume that D(p̂ ! u) < ∞. Then, there
exists a unique probability distribution p! satisfying
1. p! ∈ P ∩ Q̄;
2. D(p ! q) = D(p ! p! ) + D(p! ! q) for any p ∈ P
and q ∈ Q̄ (Pythagorean equality);
3. p! = argmin D(p̂ ! q) (maximum likelihood);
q∈Q̄
4. p! = argmin D(p ! u)
p∈P
Each of these properties determines p! uniquely.

Regularization
Overfitting:
• Features with low counts;

• Very large feature weights.
Gaussian priors: penalize large feature weights
(weight prior).
p̃(x) ← pλ (x) Pr[λ]
˜l(λ) ← l(λ) − ! n λ 2
1
!n 2
j
j=1 2σj2 − 2 j=1 log(2πσ j ).
!n λ2
with Pr[λ] = j=1
√ 1
2πσ 2
exp(− 2σ
j
2 ).
j j
This Lecture
Relative entropy
Training algorithm
Duality theorem

Training Algorithm
Convex function, but no closed-form solution.
Generalized iterative scaling (GIS).
Improved iterative scaling (IIS).
Second-order convex optimization methods.

Generalized Iterative Scaling (GIS)
(Darroch and Ratcliff, 1972)
Requirement: add feature function fn+1 such that
!
n+1
∀j, fj ≥ 0, and ∀i ∈ [1, m], fj (xi ) = C.
j=1
Parameter update:
! (0)
λj ← arbitrary
(t+1) (t) 1 Ep̂ [fj ]
λj ← λj + C log E (t) [fj ] , ∀j ∈ [1, n + 1].
p
Convergence: slow for large C .

Improved Iterative Scaling (IIS)
(Della Pietra et al., 1997)
Requirements: non-negative feature functions fj
and D(p̂ ! u) < ∞.
Parameter update:
!
(0)
λj ← arbitrary
(t+1) (t) (t)
λj ← λj + δj , ∀j ∈ [1, n + 1],
!
with p(t) (x)fj (x)eδj f# (x) = Ep̂ [fj ],
x∈X
!
n
∀x ∈ X, f# (x) = fj (x).
j=1

IIS - Convergence Proof
Ideas:
• Use of auxiliary functions (as for EM algorithm).

• Two known inequalities for convex functions:
• ∀x >!0, − log(x) ≥ 1 ! − x.
• exp( x p(x)α(x)) ≤ x p(x) exp(α(x)) ,

which follows Jensen’s inequality.

Gibbs distribution:
1 λ·f (x)
pλ (x) = e .
Zλ
Log-likelihood:
!
m !m
l(pλ ) = log pλ (xi ) = [− log Zλ + λ · f (xi )].
i=1 i=1
Auxiliary function A(δ, p):
l(pλ+δ ) − l(pλ ) ≥ A(δ, q).
∀p, A(0, p) = 0, and ∇A(0, p) = 0 iff p ∈ P.

Lower bound on log-likelihood variation:
!m Zλ+δ
l(pλ+δ ) − l(pλ ) = i=1 δ · f (xi ) − m logZλ
P λ·f (x) δ·f (x)
xe e
= mEp̂ [f ] · δ −m log Zλ
! δ·f (x)
= mδ · Ep̂ [f ] −m log !x pλ (x)e
(log inequality) ≥ mδ · Ep̂ [f ] +m − m x pλ (x)eδ·f (x)
! !n δj fj (x)
= mδ · Ep̂ [f ] + m − m x pλ (x) exp( j=1 f# (x) f# (x))
(Jensen’s inequality) ! !n fj (x) δj f# (x)
≥ mδ · Ep̂ [f ] + m − m x pλ (x) j=1 f# (x) e
n
" " fj (x) δj f# (x)
= m[δ · Ep̂ [f ] + 1 − pλ (x) e ].
x
f
j=1 #
(x)
# $% &
A(δ,pλ )
Maximizing lower bound:
∂A !
= Ep̂ [fj ] − pλ (x)fj (x)eδj f# (x) = 0.
∂δj x
" #$ %
Epλ [fj eδj f# ]
(t)
Let p be the sequence of distributions defined
by the algorithm. By Tychonov’s theorem, the set
of distributions over X is compact. Thus, this
(tk )
sequence admits a subsequence p converging
to some p . By definition of Q̄ , is in Q̄ .
! !
p
It remains to show that p is in P . By definition of !
the maximization of A, for any δ ≥ 0,

(tk ) (tk ) (tk ) (tk+1 ) (tk )
A(δ, p ) ≤ A(δ ,p ) ≤ L(p ) − L(p ),
(tk )
since L(p ) is monotonically increasing.

Taking the limit and using the continuity of A yields
∀δ ≥ 0, A(δ, p! ) ≤ 0.
But A(0, p) = 0 for all p , in particular for p = p . !
Thus, A(δ, p ) is maximum for δ = 0, ∇A(0, p ) = 0 ,

! !
and p ∈ P.
!
Thus, p !
∈ P ∩ Q̄. By the duality theorem, = p! .
p !
The uniqueness of the convergence point implies in

(t)
fact that the sequence p is converging. Otherwise,
(t)
infinitely many p s would fall outside an open ball
centered in p! and no subsequence of that infinite
set could converge to p! .
Improved Iterative Scaling (IIS)
Binary features (fj (x) ∈ {0, 1}):
• f#(x)is the total number of features on.

• Equation
eδ
defining δj is a polynomial equation
in j
.
• Newton’s method to find zeros of a function.

Monte-Carlo methods for large X.
Coincides with GIS when f# = C.

Comparison
IIS typicaly faster than GIS, particularly when C is
large.
Second-order convex optimization (but
computation of the inverse of a large Hessian).
First-order methods (conjugate gradient)?
(Malouf 2002)

Regularization
Overfitting:
• Features with low counts;

• Very large feature weights.
Gaussian priors: penalize large feature weights
(weight prior).
p̃(x) ← pλ (x) Pr[λ]
˜l(λ) ← l(λ) − ! n λ 2
1
!n 2
j
j=1 2σj2 − 2 j=1 log(2πσ j ).
!n λ2
with Pr[λ] = j=1
√ 1
2πσ 2
exp(− 2σ
j
2 ).
j j
IIS with Gaussian Priors
Auxiliary function:
(λj +δj )2 (λj )2
Ã(δ) ← A(δ) + 2σ2 − 2σj2
j
∂ Ã ∂A λj +δj
∂δj ← ∂δj + σj2 .
!Parameter update:
(0)
λj ← arbitrary
(t+1) (t) (t)
λj ← λj + δj , ∀j ∈ [1, n + 1],
! λj + δ j
with (t)
p (x)fj (x)e δj f# (x)
+ 2 = Ep̂ [fj ].
σj
x∈X
This Lecture
Relative entropy
Training algorithm
Duality theorem

Duality Theorem
Lemma1: If D(p̂ ! u) < ∞ , P ∩ Q̄ =
" ∅.
Proof: Let Φ be the continuous function defined by
Φ : Q̄ → R
q "→ D(p̂ # q).
Since D(p̂ ! u) < ∞ , the set S defined by
! "
S = q ∈ Q̄ : D(p̂ " q) ≤ D(p̂ " u)
is closed and bounded. Thus Φ attains its minimum
on S ⊆ Q̄ , at a unique point p! because Φ is
strictly convex (property of relative entropy).
Duality Theorem
It remains to show that p! is in P. Define
1 λ·f (x)
Ψ:R→R with ∀x, qλ (x) = e ,
Zλ
λ "→ D(p̂ # qλ+λ! ) and qλ! = p" .
Ψ is continuously differentiable and attains its

minimum at λ = 0. Thus,(∇Ψ)(0) = 0.
! !
Ψ(λ) = constant + p̂(x)[log e(λ+λ! )·f (x) − (λ + λ" ) · f (x)]
x x
! P (λ+λ" )·f (x)
xe f (x)
∇Ψ = x p̂(x)[
P (λ+λ" )·f (x) − f (x)]
xe
= Eqλ+λ" [f ] − Ep̂ [f ].
Duality Theorem
Lemma 2: ∀(p, q) ∈ P × Q̄, D(p $ q) = D(p $ p! ) + D(p! $ q).
Proof:
!Q̄, D(p $ p! ) + D(p! $ q) !
∀(p, q) ∈ P ×
= −H(p) + !x p(x) log p! (x) − H(p! ) + x p! (x) log q(x)
=!−H(p) + x p(x)[− log Zλ! + λ! ·! f (x)]
− x p! (x)[−!log Zλ! + λ! · f (x)] + x p! (x) log q(x)
= −H(p) + !x p! (x) log q(x) (Ep [f ] = Ep! [f ])
= −H(p) + !x p! (x)[− log Zλ + λ · f (x)]
= −H(p) + x p(x)[− log Zλ + λ · f (x)] (Ep! [f ] = Ep [f ])
= D(p $ q).
Note: ∀q ∈ Q̄, Ep [log q] = Ep! [log q].

Duality Theorem
Proof: By lemma 1, there exists a unique p!
verifying property 1. By lemma 2, p! verifies
property 2. Property 2 implies:
D(p̂ ! q) = D(p̂ ! p! ) + D(p! ! q) ≥ D(p̂ ! p! ) and
D(p ! u) = D(p ! p! ) + D(p! ! u) ≥ D(p! ! u).
Each of the properties determines p! uniquely. For
example, if p! verifies property 1, then by lemma 2,
∀(p, q) ∈ P × Q̄, D(p $ q) = D(p $ p! ) + D(p! $ q).
Thus, D(p! ! p! ) = D(p! ! p! ) + D(p! ! p! ) = 0,
and p! = p! . Idem if p! verifies property 2.
Duality Theorem
If p verifies property 3, then D(p̂ ! p ) ≤ D(p̂ ! p! ).
! !
Since D(p̂ ! p! ) = D(p̂ ! p! ) + D(p! ! p! ) , this

implies D(p! ! p ) = 0 .
!
If verifies property 4, then D(p ! u) ≤ D(p! ! u).

p ! !
SinceD(p ! u) = D(p ! p! ) + D(p! ! u) , this

! !
implies D(p ! p! ) = 0 .
!

Bregman Divergences
Definition: let F be a convex and differentiable
function, then the Bregman divergence based on F
is defined as
∆F (y, x) = F (y) − F (x) − (y − x) · ∇x F (x).
Examples:
• Unnormalized relative entropy. F (y)
• Euclidean distance.
F (x) + (y − x) · ∇x F (x) x y
References
• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, (22-1), March 1996;
• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American
Statistical Association 39, 357–365.
• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures.
Statistics and Decisions, Supplement Issue 1, 205-237, 1984.
• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative

scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.
• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals
of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.
• Stephen Della Pietra,Vincent Della Pietra, and John Lafferty. Inducing features of random
fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393,
April, 1997.

References
• Stephen Della Pietra,Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions
for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science,
CMU, 2001.
• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630,

1957.
• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor),

D. Reidel Publishing Company, 1983.
• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-

maximization. Codes, Curves and Signals: Common Threads in Communications, A.Vardy,
(editor), Kluwer, 1998.
• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.

Computer, Speech and Language 10:187--228, 1996.

Lecture 6 PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lecture 6 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Speech Recognition

Lecture 6: Statistical Language Models -

Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU

• Measure of uncertainty of p(x).

• Assymetric measure of deviation between two

• The general case is a consequence of the

Mehryar Mohri - Advanced Machine Learning page 5 Courant Institute, NYU

Mehryar Mohri - Advanced Machine Learning page 6 Courant Institute, NYU

Mehryar Mohri - Advanced Machine Learning page 7 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU

= −H(p̂) − N10 l(p).

Note: for a uniform prior, MAP coincides with

Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU

• n-grams, distance-d n-grams.

Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU

Optimization problem: find distribution p!

Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU

where Q̄ is the closure of Q .

Each of these properties determines p! uniquely.

• Features with low counts;

Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 22 Courant Institute, NYU

Convergence: slow for large C .

Mehryar Mohri - Speech Recognition page 23 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU

• Use of auxiliary functions (as for EM algorithm).

• exp( x p(x)α(x)) ≤ x p(x) exp(α(x)) ,

Mehryar Mohri - Speech Recognition page 25 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU

the maximization of A, for any δ ≥ 0,

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU

Thus, A(δ, p ) is maximum for δ = 0, ∇A(0, p ) = 0 ,

The uniqueness of the convergence point implies in

• f#(x)is the total number of features on.

• Newton’s method to find zeros of a function.

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU

• Features with low counts;

Mehryar Mohri - Speech Recognition page 34 Courant Institute, NYU

Ψ is continuously differentiable and attains its

Mehryar Mohri - Speech Recognition page 37 Courant Institute, NYU

Since D(p̂ ! p! ) = D(p̂ ! p! ) + D(p! ! p! ) , this

If verifies property 4, then D(p ! u) ≤ D(p! ! u).

SinceD(p ! u) = D(p ! p! ) + D(p! ! u) , this

Mehryar Mohri - Speech Recognition page 39 Courant Institute, NYU

• Unnormalized relative entropy. F (y)

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative

Mehryar Mohri - Speech Recognition page 41 Courant Institute, NYU

• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630,

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor),

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.

Mehryar Mohri - Speech Recognition page 42 Courant Institute, NYU

Você também pode gostar