Você está na página 1de 27

Bayesian Inference with

Dirichlet Process
Mo Chen

1
Outline
• Preliminary
– Bayesian Inference
– Exponential Family
– Directed Graphical Model
– Gibbs Sampling
– Finite Mixture Model
• Dirichlet Process
– Dirichlet Distribution
– Dirichlet Process
– Infinite Mixture Model
• Representation of DP
– Chinese Restaurant Process
– Stick Breaking Construction
2

2
Glossary
• Observation: X = N
{xn }n=1
• Latent variable: z, z
• Parameter: θ

• Joint distribution: p(x, z)


• Conditional distribution: p(x|z)
• Marginal distribution:
� �
p(x) = p(x, z)dz = p(x|z)p(z)dz

3
Glossary
• Parameterized distribution
xi ∼ p(x|θ)
– with density function: p(x|θ)
• Likelihood: N

p(X|θ) = p(xn |θ)
• Prior: p(θ) n=1

• Posterior: p(θ|X)

4
Inference: Frequentist
• (1) Fitting (training)
– ML θ̂ = argmax p(X|θ)
– MAP θ̂ = argmax p(θ|X)

• (2) Inference (predict)


p(x|θ̂)

5
Gaussian: Frequentist (ML)
• (1) Fitting (training): θ̂ = argmax p(X|θ)
N

max ln N (xn |µ, Σ)
µ,Σ
n=1
�N �N
1 1
µ̂ = xn Σ̂ = (xn − µ̂)(xn − µ̂)T
N n=1 N n=1

• (2) Inference (predict): p(x|θ̂)


N (x|µ̂, Σ̂)

6
Gaussian: MAP Fitting
• Prior
p(µ, Λ)

• Posterior
p(µ, Λ|X) ∝ p(X|µ, Λ)p(µ, Λ)
• MAP
max ln p(X|µ, Λ) + ln p(µ, Λ)
µ,Λ

7
Inference: Bayesian
• Prior
p(θ)

• Posterior p(X|θ)p(θ)
p(θ|X) =
p(X)
» where �
p(X) = p(X|θ)p(θ)dθ
• Inference �
p(x|X) = p(x|θ)p(θ|X)dθ

8
Gaussian: Bayesian

• Prior (conjugate)
p(µ, Λ) = N (µ|mo , (κo Λ)−1 )W(Λ|(T o )−1 , ν o )
• Posterior
p(µ, Λ|X) = N (µ|m, (κΛ)−1 )W(Λ|T −1 , ν).
o o
κ=κ +N ν =ν +N
κo mo + N x̄
m=
κo + N
o o o T
o κ N ( x̄ − m )( x̄ − m )
T = T + NS +
κo + N
9

9
Gaussian: Bayesian

• Inference

� �
p(x|X) = p(x|µ, Λ)p(µ, Λ|X)dµdΛ
� �
−1 −1 −1
= N (x|µ, Λ )N (µ|m, (κΛ) )dµW(Λ|T , ν)dΛ

(κ + 1)T
= T (x|m, , ν − d + 1)
κ(ν − d + 1)

10

10
Student’s t-Distribution

11
Student’s t-Distribution
Robustness to outliers: Gaussian vs t-
distribution.

12
Bayesian Inference
• Methodology: Integral

• Tractable: Analytical solution

• Intractable: Approximation
– Monte Carlo (Gibbs Sampling)
– Taylor Expansion (Laplace approximation)
– Variational methods (VB, EP)

13

13
The Exponential Family
• Density
� T

p(x|θ) = h(x)g(θ) exp θ φ(x)

• Conjugate prior
o o o o ν o � o T o

p(θ|χ , ν ) = f (χ , ν )g(θ) exp ν θ χ
• Posterior � �
p(θ|X, χo , ν o ) = p(θ|χ, ν) = f (χ, ν)g(θ)ν exp νθT χ
» where �N
o o
o ν χ + n=1 φ(xn )
ν =ν +N χ=
νo + N
14

14
The Exponential Family
• Marginalization

o o
p(X) = p(X|θ)p(θ|χ , ν )dθ

o N
�o
f (χ , ν )
= h(xn )
f (χ, ν) n=1

15

15
Bayesian Networks
Directed Acyclic Graph (DAG)

p(a, b, c) = p(c|a, b)p(b|a)p(a)

p(x1 , · · · , xm ) = (xm |x1 , · · · , xm−1 ) · · · p(x2 |x1 )p(x1 )

16
Bayesian Networks
p(x1 , · · · , x7 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )
p(x5 |x1 , x3 )p(x6 |x4 )p(x7 |x4 , x5 )

General Factorization
m

p(x) = p(xi |π(xi ))
i=1

17
Gibbs Sampling
• Sequentially sampling from (one at a
time)

ẑi ∼ p(zi |Z\i , X)

• Directed graphical model

x̂i ∼ p(xi |π(xi ), λ(xi ), π(λ(xi )))

18

18
Mixtures of Gaussians
Old Faithful data set

Single Gaussian Mixture of two


Gaussians

19
Mixtures of Gaussians
Combine simple models
into a complex model:

Component
Mixing coefficient
K=3

20
Mixture of Gaussian
LS AND EM
K

same graph as in Figure zkexcept that
9.6
p(z|π) = πk zn
se that the discrete variables zn are ob-
k=1 π
as the data variables xn .
p(zk = 1) = πk
K
� xn
p(x|z) = N (x|µk , Σk )zk µ Σ
k=1
N

� � �K
sider the problem
p(x) = of maximizing
p(x, z) = the likelihood
p(x|z)p(z) = for the complete data
πk N (x|µk , Σk )
rom (9.10) and (9.11), this likelihood function takes the form
z z k=1
N !
! K
znk
p(X, Z|µ, Σ, π) = πk N (xn |µk , Σk )znk 21 (9.35)
n=1 k=1 21
10.2. Illustration: Variational Mixture of Gaussians 475
Bayesian MoG
c graph representing the Bayesian mix- π zn Λ
ans model, in which the box (plate) de-
• Conjugate
N i.i.d. priors
observations. Here µ denotes
notes {Λk }.
– Dirichlet for π

p(π) = D(π|α ) o xn µ

– Gaussian-Wishart for µ, Λ N
K

o o −1 o −1 o
p(µ, Λ) = N (µk |m , (κ Λk ) )W(Λk |(T ) , ν ).
s we have seen, the parameter α0 can be interpreted as the effective
k=1
of observations associated with each component of the mixture. If the
small, then the posterior distribution will be influenced primarily by
p(x, z, π, µ, Λ) = p(x|z, µ, Λ)p(z|π)p(π)p(µ|Λ)p(Λ)
than by the prior.
we introduce an independent Gaussian-Wishart prior governing the
22
cision of each Gaussian component, given by
22
Gibbs Sampling for MoG
• Sample indicator variables
K

1
znk ∼ πk p(xn |µk , Σk ) Cn = πk p(xn |µk , Σk )
Cn
k=1
• Sample mixture weights
π ∼ D(π|N1 + αo , . . . , Nk + αo )

• Sample statistics of Gaussian


−1
Λk ∼ W(Λ|νk , Wk ) µk ∼ N (µ|mk , (κk Λk ) )

23

23
The Multinomial Distribution

K
� zk
p(z|µ) = M(z|µ) = µk
k=1

K

µk = 1
k

24

24
The Dirichlet Distribution

Conjugate prior for


the multinomial
distribution.

25
Dirichlet Posterior

K
� αk +mk −1
p(µ|Z, α) ∝ p(Z|µ)p(µ|α) ∝ µk
k=1

p(µ|Z, α) = D(µ|α + m)

26
Dirichlet

27

Você também pode gostar