Lecture 5

Lecture Date: February 8th , 2010
Stat260: Bayesian Modeling and Inference
The Conjugate Prior for the Normal Distribution

Lecturer: Michael I. Jordan
Scribe: Teodor Mihai Moldovan
We will look at the Gaussian distribution from a Bayesian point of view. In the standard form, the likelihood
has two parameters, the mean and the variance 2 :

1 X
1
P (x1 , x2 , , xn | , 2 ) n exp 2
(xi )2
(1)
2
Our aim is to find conjugate prior distributions for these parameters. We will investigate the hyper-parameter
(prior parameter) update relations and the problem of predicting new data from old data: P (xnew | xold ).
Fixed variance ( 2 ), random mean ()
Keeping 2 fixed, the conjugate prior for is a Gaussian.

P ( | 0 ,
typically 0
02

1
1
2
)
exp 2 ( 0 )
0
20
(2)
typically large
Remark 1. In practice, when little is known about , it is common to set the location hyper-parameter to
zero and the scale to some large value.
1.1
Posterior for single measurement (n = 1)
We want to put together the prior (2) and the likelihood (1) to get the posterior ( | x). For now, assume
we have only one measurement (n = 1);
There are several ways to do this:
We could multiply the two distributions directly and complete the square in the exponent.
Note that and x have a joint Gaussian distribution. Then the conditional | x is also a Gaussian for
whose parameters we know formulas:
Lemma 2. Assume (z1 , z2 ) is distributed according to a bivariate Gaussian. Then z1 | z2 is Gaussian distributed with parameters:
E(z1 | z2 ) = E(z1 ) +
Cov(z1 , z2 )
(z2 E(z2 ))
Var(z2 )
Var(z1 | z2 ) = Var(z1 )
Cov2 (z1 , z2 )
Var(z2 )
(3)
(4)
Remark 3. These formulas are extremely useful so you should memorize them. They are easily derived based
on the notion of a Schur complement of a matrix.
We apply this lemma with the correspondence: x z2 , z1
x = +
N (0, 1)
= 0 + 0
N (0, 1)
E(x) = 0
(5)
2
Var(x) = E(Var(x | )) + Var(E(x | )) = +

Cov(x, ) = E(x 0 )( 0 ) =
02
02
(6)
(7)
Using equations 3 and 4:

E( | x) = 0 +
02
2
2
(x 0 ) = 2 0 2 x + 2
0
2
+ 0
+ 0
+ 02
MLE
Var( | x) =
2 02
=
+ 02
1
02
1
+
1
2
(8)
prior mean
= (prior + data )
(9)
Definition 4. 1 / 2 is usually called the precision and is denoted by

The posterior mean is usually a convex combination of the prior mean and the MLE.
The posterior precision is, in this case, the sum of the prior precision and the data precision
post = prior + data
We summarize our results so far:
Lemma 5. Assume x | N (, 2 ) and N (0 , 02 ). Then:
|x N
1.2
2
02
x
+
0 ,
2
2 + 0
2 + 02
1
1
+ 2
02
1 !
Posterior for multiple measurements (n 1)
Now look at the posterior update for multiple measurements. We could adapt our previous derivation, but
that would be tedious since we would have to use the multivariate version
of Lemma 2. Instead we will
P
reduce the problem to the univariate case, with the sample mean x
= ( xi )/n as the new variable.

2
xi | N (, 2 ) i.i.d. x
| N ,
(10)
n

1
1 X
P (x1 , x2 , , xn | ) exp 2
(xi )2
2

X
1 X 2
xi 2
xi + n2
exp 2
2
n
n

exp 2 2
x + 2 exp 2 (
x )2
2
2
P (
x | )
(11)
Then for the posterior probability, we get

P ( | x1 , x2 , , xn ) P (x1 , x2 , , xn | )P () P (
x | )P ()
P ( | x
)
(12)
We can now plug x

into our previous result and we get:
Lemma 6. Assume xi | N (, 2 ) i.i.d.and N (0 , 02 ). Then:
| x1 , x2 , , xn N
02
x+
2
2
n + 0
2
,
2
2 0
n + 0
1
n
+ 2
02
1 !
Random variance ( 2 ), fixed mean ()
2
2.1
Posterior
Assuming is fixed, then the conjugate prior for 2 is an inverse Gamma distribution:

1
z
exp
z | , IG(, )
P (z | , ) =
()
z
(13)
For the posterior we get another inverse Gamma:

P

+ 12 (xi )
n
P ( 2 | , ) ( 2 )(+ 2 )1 exp
2

post
( 2 )post 1 exp 2
(14)
Lemma 7. If xi | , 2 N (, 2 ) i.i.d.and 2 IG(, ). Then:

n
1X
2
| x1 , x2 , , xn IG + , +
(xi )
2
2
If we re-parametrize in terms of precisions, the conjugate prior is a Gamma distribution.
| , Ga(, )
P ( | , ) =
exp ( )
()
(15)
And the posterior is:

n
1X
P ( | , ) (+ 2 )1 exp +
(xi )
2
Lemma 8. If xi | , N (, ) i.i.d.and Ga(, ). Then:

n
1X
| x1 , x2 , , xn Ga + , +
(xi )
2
2
Remark 9. Should we prefer working with variances or precisions? We should prefer both:
Variances add when we marginalize
Precisions add when we condition
(16)
2.2
Prediction
We might want to compute the probability of getting some new data given old data. This can be done by
marginalizing out parameters:
Z
P (xnew | x, , , ) = P (xnew | x, , , , )P ( | x, , )d
Z
= P (xnew | , )P ( | x, , )d
Z
= P (xnew | , )P ( | post , post )d
(17)
This integral smears the Gaussian into a heavier tailed distribution, which will turn out to be a students
t-distribution:
| , Ga(, )
x | , N (, )

1 12
e
exp (x )2 d
()
2
2
Z
2
1
1
=
(+ 2 )1 e (+(x) )/2 d
() 2
Z
P (x | , , ) =
Gamma integral; use memorized normalizing constant

+ 12
1
=
() 2 + 1 (x )2 + 12
2

+ 12
1
1
=
1
+ 21
() (2) 2
1
1 + 2
(x )2
(18)
Remark 10. The student-t density has three parameters: , , and is symmetric around . When
is an
integer or a half-integer we get simplifications using the formulas (k + 1) = k(k) and (1/2) =
The following is another useful parametrization for the students t-distribution:
p = 2
p+1
2
P (x | , p, ) =
p2
=

12
1

1+
p (x
)2
p+1
2
(19)
with two interesting special cases:

If p = 1 we get a Cauchy distribution
If p we get a Gaussian distribution
Remark 11. We might want to sample from a students t-distribution. We would sample Ga(, ), then
sample xi N (, ), collect xi and repeat.
Both variance ( 2 ) and mean () are random
Now, we want to put a prior on and 2 together. We could simply multiply the prior densities we obtained
in the previous two sections, implicitly assuming and 2 are independent. Unfortunately, if we did that,
we would not get a conjugate prior. One way to see this is that if we believe that our data is generated
according to the graphical model in Figure 1, we find that, conditioned on x, the two parameters and 2
are, in fact, dependent and this should be expressed by a conjugate prior.
02
2
x
Figure 1: and 2 are dependent conditioned on x

We will use the following prior distribution which, as we will show, is conjugate to the Gaussian likelihood:
xi | , N (, )
i.i.d.
| N (0 , n0 )
Ga(, )
3.1
Posterior
First look at | x, . This is the simpler part, as we can use Lemma 8:

n
n0
| x, N
x
+
0 , n + n0
n + n0
n + n0
(20)
Next, look at | x. We get this by expressing the joint density P (, | x) and marginalizing out :
P (, | x) P ( ) P ( | ) P (x | , )
n

X

0
1 e 1/2 exp
( 0 )2 n/2 exp
(xi )2
2
2
trick: xi x
+x

n
1X
+ 2 1 exp +
(xi x
)2
1/2 exp (n0 ( 0 )2 + n(
x )2 )
2
2
(21)
(22)
As we integrate out we get the normalization constant:

1
nn0
(
x 0 )2
2 exp
2(n + n0 )
Which leads to a Gamma posterior for :

n
1X
nn0
P ( | x) + 2 1 exp +
(xi x
)2 +
(
x 0 )2
2
2(n + n0 )
To summarize:
(23)
Lemma 12. If we assume:

xi | , N (, )
i.i.d.
| N (0 , n0 )
Ga(, )
Then the posterior is:

n
n0
| , x N
x
+
0 , n + n0
n + n0
n + n0

n
1X
nn0
2
2
| x Ga +
, +
(xi x
) +
(
x 0 )
2
2
2(n + n0 )
3.2
Prediction
Z Z
P (xnew | x) =
Gamma Gaussian Gaussian d d

|x
| ,x
xnew | ,
Z
Z
P (xnew | x) = Gamma Gaussian Gaussian d d
|x
| ,x
xnew | ,
Z
P (xnew | x) = Gamma Gaussian d
|x
P (xnew | x) = student-t
xnew | x
xnew | ,x

Lecture 5

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lecture 5

Enviado por

Direitos autorais:

Formatos disponíveis

Lecture Date: February 8th , 2010

Stat260: Bayesian Modeling and Inference

The Conjugate Prior for the Normal Distribution

Scribe: Teodor Mihai Moldovan

Fixed variance ( 2 ), random mean ()

Keeping 2 fixed, the conjugate prior for is a Gaussian.

Posterior for single measurement (n = 1)

The Conjugate Prior for the Normal Distribution

Var(x) = E(Var(x | )) + Var(E(x | )) = +

Using equations 3 and 4:

Definition 4. 1 / 2 is usually called the precision and is denoted by

Posterior for multiple measurements (n 1)

The Conjugate Prior for the Normal Distribution

Then for the posterior probability, we get

We can now plug x

Random variance ( 2 ), fixed mean ()

For the posterior we get another inverse Gamma:

Lemma 7. If xi | , 2 N (, 2 ) i.i.d.and 2 IG(, ). Then:

And the posterior is:

The Conjugate Prior for the Normal Distribution

Gamma integral; use memorized normalizing constant

with two interesting special cases:

The Conjugate Prior for the Normal Distribution

Both variance ( 2 ) and mean () are random

Figure 1: and 2 are dependent conditioned on x

First look at | x, . This is the simpler part, as we can use Lemma 8:

As we integrate out we get the normalization constant:

The Conjugate Prior for the Normal Distribution

Lemma 12. If we assume:

Gamma Gaussian Gaussian d d

Você também pode gostar