Lecture10 PDF

Bayesian Estimation
• Bayesian estimators differ from all classical estimators studied

so far in that they consider the parameters as random
variables instead of unknown constants.
• As such, the parameters also have a PDF, which needs to
be taken into account when seeking for an estimator.
• The PDF of the parameters can be used for incorporating
any prior knowledge we may have about its value.
Bayesian Estimation
• For example, we might know that the normalized

frequency f0 of an observed sinusoid cannot be greater
than 0.1. This is ensured by choosing

10, if 0 6 f0 6 0.1
p(f0 ) =
0, otherwise
as the prior PDF in the Bayesian framework.

• Usually differentiable PDF’s are easier, and we could
approximate the uniform PDF with, e.g., the Rayleigh PDF.
Uniform density Rayleigh density with σ = 0.035
20
10
15
Prior
Prior
10
5
5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized frequency f0 Normalized frequency f0
Prior and Posterior estimates
• One of the key properties of Bayesian approach is that it

can be used also for small data records, and the estimate
can be improved sequentially as new data arrives.
• For example, consider tossing a coin and estimating the
probability of a head, µ.
• As we saw earlier, the ML estimate is the number of
observed heads divided by total number of tosses:
#heads
µ̂ = #tosses .
• However, if we can not afford to make more than, say,
three experiments, we may end up seeing three heads and
no tails. Thus, we are forced to infer that µ̂ML = 1, the coin
lands always as a head.
• The Bayesian approach can circumvent this problem,

because the prior regularizes the likelihood and avoids
overfitting to the small amount of data.
• The pictures below illustrate this. The one on the top is the
likelihood function
p(x | µ) = µ#heads (1 − µ)#tails
with #heads = 3 and #tails = 0. The maximum of the

function is at unity.
• The second curve is the prior density p(µ) of our choice. It
was selected to reflect the fact that we assume that the coin
is probably quite fair.
• The third curve is the posterior density p(µ | x) after

observing the samples, which can be evaluated using the
Bayes formula
p(x | µ) · p(µ) likelihood · prior

p(µ | x) = =
p(x) p(x)
• Thus, the third curve is the product of the first two (with
normalization), and one Bayesian alternative is to use the
maximum as the estimate.
Likelihood function after three tosses resulting in a head

1
p( x|µ) 0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Prior density before observing any data
0.05
p(µ)
0.025
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Posterior density after observing 3 heads
0.1
p(µ| x)
0.05
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Cost Functions
• Bayesian estimators are defined by a minimization

problem
ZZ
θ̂ = arg min C(θ − θ̂)p(x, θ) dx dθ
θ̂
which seeks for the value of θ̂ that minimizes the average

cost.
Cost Functions
• The cost function C(x) is typically one of the following
1. Quadratic: C(x) = x2
2. Absolute: C(x) = |x|

0, |x| < δ
3. Hit-or-miss: C(x) =
1, |x| > δ
• Additional cost functions include Huber’s robust loss and
-insensitive loss.
Cost Functions
• These three cost functions are favoured, because we can

find the minimum cost solution in closed form. We will
introduce the solutions next.
• Functions 1 and 3 are slightly easier to use than 2. Thus,
we’ll concentrate on those.
• Regardless of the cost function, the above double integral
can be evaluated and minimized using the rule for joint
probabilities:
p(x, θ) = p(θ | x)p(x).
Cost Functions
• This results in
ZZ Z Z
C(θ−θ̂)p(θ | x)p(x) dx dθ = C(θ − θ̂)p(θ | x) dθ p(x) dx
| {z }
(∗)
• Because p(x) is always nonnegative, it suffices to minimize

the multiplier inside the brackets, (∗)1 :
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂
1
Note, that there’s a slight shift in the paradigm. The double integral results in the theoretical estimate that
requires the knowledge of p(x). When minimizing only the inner integral, we get the optimum for a particular
realization, not all possible realizations.
1. Quadratic Cost Solution (or the MMSE
estimator)
• If we select the quadratic cost, then the Bayesian estimator

is defined by
Z
arg min (θ − θ̂)2 p(θ | x) dθ
θ̂
• Simple differentiation gives:

Z Z
∂ ∂
(θ − θ̂) p(θ | x) dθ =
2
(θ − θ̂)2 p(θ | x) dθ

∂θ̂ Z ∂θ̂
= −2(θ − θ̂) p(θ | x) dθ
estimator)
• Setting this equal to zero gives

Z
−2(θ − θ̂) p(θ | x) dθ = 0
Z Z
⇔ 2θ̂ p(θ | x) dθ = 2θ p(θ | x) dθ
Z Z
⇔ θ̂ p(θ | x) dθ = θ p(θ | x) dθ
| {z }
=1
Z
⇔ θ̂ = θ p(θ | x) dθ
estimator)
• Thus, we have the minimum:

Z
θ̂MMSE = θp(θ | x) dθ = E(θ | x),
i.e., the mean of posterior PDF, p(θ | x).2

• This is called the minimum mean square error estimator
(MMSE estimator), because it minimizes the average
squared error.
2
Prior PDF, p(θ), refers to the parameter distribution before any observations are made. Posterior PDF, p(θ | x),
refers to the parameter distribution after observing the data.
2. Absolute Cost Solution
• If we choose the absolute value as the cost function, we

have to minimize
Z
arg min θ − θ̂ p(θ | x) dθ

θ̂
• This can be shown to be equivalent to the following

condition
Z θ̂ Z∞
p(θ | x) dθ = p(θ | x) dθ
−∞ θ̂
2. Absolute Cost Solution
• In other words, the estimate is the value which divides the

probability mass into equal proportions:
Z θ̂
1
p(θ | x) dθ =
−∞ 2
• Thus, we have arrived at the definition of the median of the

posteriori PDF.
3. Hit-or-miss Cost Solution (or the MAP
estimator)
• For the hit-or-miss case, we also need to minimize the

inner integral:
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂
with
0, |x| < δ
C(x) =
1, |x| > δ
estimator)
• The integral becomes
Z Z θ̂−δ Z∞
C(θ− θ̂)p(θ | x) dθ = 1·p(θ | x) dθ+ 1·p(θ | x) dθ
−∞ θ̂+δ
or in a simplified form
Z Z θ̂+δ
C(θ − θ̂)p(θ | x) dθ = 1 − 1 · p(θ | x) dθ
θ̂−δ
estimator)
• This is minimized by maximizing
Z θ̂+δ
p(θ | x) dθ
θ̂−δ
• For small δ and smooth p(θ | x) the maximum of the

integral occurs at the maximum of p(θ | x).
• Therefore, the estimator is the mode (the highest value) of
the posteriori PDF. Thus the name Maximum a Posteriori
(MAP) estimator.
estimator)
• Note, that the MAP estimator
θ̂MAP = arg max p(θ | x)

θ
is calculated as (using the Bayes’ rule):
p(x | θ)p(θ)
θ̂MAP = arg max
θ p(x)
estimator)
• Since p(x) does not depend on theta, it is equivalent to

maximize only the numerator:
θ̂MAP = arg max p(x | θ)p(θ)

θ
• Incidentally, this is close to the ML estimator:
θ̂ML = arg max p(x | θ)

θ
The only difference is the inclusion of the prior PDF.

Summary
• To summarize, the three most widely used Bayesian

estimators are
1 The MMSE, θ̂MMSE = E(θ | x)
Rθ̂
2 The Median, or θ̂ with −∞ p(θ | x) dθ = 12 .
3 The MAP, θ̂MAP = arg maxθ p(x | θ)p(θ)
Example
• Consider the case of tossing a coin for three times resulting

in three heads.
• In the example, we used the Gaussian prior

1 1 2
p(µ) = √ exp − 2 (µ − 0.5) .
2πσ2 2σ
• Now the µ̂MAP becomes
µ̂MAP = arg max p(x | µ)p(µ)

µ

#heads #tails 1 1 2
= arg max µ (1 − µ) √ exp − 2 (µ − 0.5)
µ 2πσ2 2σ
Example
• Let’s simplify the arithmetic by setting # heads = 3 and #

tails = 0:

3 1 1 2
µ̂MAP = arg max µ √ exp − 2 (µ − 0.5)
µ 2πσ2 2σ
• Equivalently, we can maximize it’s logarithm:

√ 1

arg max 3 ln µ − ln 2πσ2 − 2 (µ − 0.5)2
µ 2σ
Example
• Now,
∂ 3 (µ − 0.5)
ln [p(x|µ)p(µ)] = − = 0,
∂µ µ σ2
when
µ2 − 0.5µ − 3σ2 = 0.
This happens when
p √
0.5 ± 0.25 − 4 · 1 · (−3σ2 ) 0.25 + 12σ2
µ= = 0.25 ± .
2 2
Example
• If we substitute the value used in the example, σ = 0.1,

√
0.37
µ̂MAP = 0.25 + ≈ 0.554.
2
• Thus, we have found the analytical solution of the
maximum of the curve in slide 5.
Vector Parameter Case for MMSE
• In vector parameter case, the MMSE estimator is
θ̂MMSE = E(θ | x)
or more explicitly
R
R θ1 p(θ | x) dθ

 θ2 p(θ | x) dθ 
θ̂MMSE =
 
.. 

R . 
θp p(θ | x) dθ
• In the linear model case, there exists a straightforward

solution:
If the observed data can be modeled as
x = Hθ + w,
where θ ∼ N(µθ , Cθ ) and w ∼ N(0, Cw ), then
E(θ | x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )

• It is possible to derive an alternative form resembling the

LS estimator (exercise):
E(θ | x) = µθ + (C−1 T −1 −1 T −1
θ + H Cw H) H Cw (x − Hµθ ).
• Note that this becomes the LS estimator if µθ = 0 and

Cθ = I and Cw = σ2w I.
Vector Parameter Case for the MAP
• The MAP estimator can also be extended to vector

parameters:
θ̂MAP = arg max p(θ | x)
θ
or, using the Bayes’ rule,
θ̂MAP = arg max p(x | θ)p(θ)

θ
• Note, that in general this is different from p scalar MAP’s.

Scalar MAP would maximize for each parameter θi
individually, but the vector MAP seeks for the global
maximum of the vector space.
Example: MMSE Estimation of Sinusoidal
Parameters
• Consider the data model
x[n] = a cos 2πf0 n+b sin 2πf0 n+w[n], n = 0, 1, . . . , N−1
or in vector form
x = Hθ + w,
where
 
1 0
 cos 2πf0 sin 2πf0 
a
 
H=
 cos 4πf0 sin 4πf0 
and θ=
b

 .. 
 . 
cos(2(N − 1)πf0 ) sin(2(N − 1)πf0 )
Parameters
• We depart from the classical model by assuming that a and

b are random variables with prior PDF θ ∼ N(0, σ2θ I). Also
w is assumed Gaussian (N(0, σ2 )) and independent of θ.
• Using the second version of the formula for the linear
model (on slide 28), we get the MMSE estimator:
E(θ | x) = µθ + (C−1 T −1 −1 T −1
θ + H Cw H) H Cw (x − Hµθ )
Parameters
or, in our case,3

−1
1 T 1 1
E(θ | x) = 2
I + H 2 IH HT 2 Ix
σθ σw σw
−1
1 1 1
= 2
I + 2 HT H HT 2 x
σθ σw σw
3
Note the correspondence with Ridge regression. It holds that Ridge regression is equivalent to the Bayesian
estimator with Gaussian prior for the coefficients. It also holds that the LASSO is equivalent to the Bayesian
estimator with Laplacian prior.
Parameters
• In earlier examples we have seen that the columns of H are

nearly orthogonal (exactly orthogonal if f0 = k/N):
N
HT H ≈ I
2
• Thus,
−1
1 N 1
E(θ | x) ≈ 2
I+ 2 I HT x
σθ 2σw σ2w
1
σ2w
= 1 N
HT x.
σ2θ
+ 2σ2w
Parameters
• In all, the MMSE estimates become
2 X
N−1
" #
1
âMMSE = 2σ2 /N
x[n] cos 2πf0 n
1+ N
σ2θ n=0
2 X
" N−1 #
1
b̂MMSE = 2σ2 /N
x[n] sin 2πf0 n
1+ N
σ2θ n=0
Parameters
• For comparison, recall that the classical MVU estimator is
2 X
N−1
âMVU = x[n] cos 2πf0 n
N
n=0
2 X
N−1
b̂MVU = x[n] sin 2πf0 n
N
n=0
Parameters
• The difference can be interpreted as a weighting between

the prior knowledge and the data.
• If the prior knowledge is unreliable (σ2θ large), then
1
2 ≈ 1 and the two estimators are almost equal.
1+ 2σ 2/N
σ
θ
• If the data is unreliable (σ2 large), then the coefficient
1
2 is small, making the estimate close to the mean of
1+ 2σ 2/N
σ
θ
the prior PDF.
Parameters
• An example run is illustrated below. In this case, N = 100,

f0 = 15/N, and σ2θ = 0.48566, σ2 = 4.1173. Altogether
M = 500 tests were performed.
• Since the prior PDF has a small variance, the estimator
gains a lot from using it. This is seen as a significant
difference between the MSE’s of the two estimators.
Parameters
Classical estimator of a. MSE=0.072474 Classical estimator of b. MSE=0.092735

60 60
50 50
40 40
30 30
20 20
10 10
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Bayesian estimator of a. MSE=0.061919 Bayesian estimator of b. MSE=0.076355

60 60
50 50
40 40
30 30
20 20
10 10
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Parameters
• If the prior has a higher variance, the Bayesian approach

does not perform that much better. In the pictures below,
σ2θ = 2.1937, σ2 = 1.9078. The difference in performance is
negligible between the two approaches.
Parameters
Classical estimator of a. MSE=0.040066 Classical estimator of b. MSE=0.034727

60 60
50 50
40 40
30 30
20 20
10 10
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Bayesian estimator of a. MSE=0.03951 Bayesian estimator of b. MSE=0.034477

60 60
50 50
40 40
30 30
20 20
10 10
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Parameters
• The program code is available at

http://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m
Example: MAP Estimator
• Assume that

θ exp(−θx[n]) if x[n] > 0
p(x[n] | θ) =
0, if x[n] < 0
with x[n] conditionally IID and the prior of θ:

λ exp(−λθ) if θ > 0
p(θ) =
0 if θ < 0
• Now, θ is the unkown RV and λ is known.

• Then the MAP estimator is found by maximizing p(θ | x)

or equivalently p(x | θ)p(θ).
• Because both PDF’s have an exponential form, it’s easier to
maximize the logarithm instead:
θ̂ = arg max (ln p(x | θ) + ln p(θ)) .

θ
• Now,
Y
"N−1 #
ln p(x | θ) + ln p(θ) = ln θ exp(−θx[n]) + ln[λ exp(−λθ)]
n=0
X
" N−1
!#
N
= ln θ exp −θ x[n] + ln[λ exp(−λθ)]
n=0
= N ln θ − Nθx̄ + ln λ − λθ
• Differentiation produces
d N
ln p(x | θ) + ln p(θ) = − Nx̄ − λ
dθ θ
• Setting it equal to zero produces the MAP estimator:
1
θ̂ = λ
x̄ + N
Example: Deconvolution
• Consider the situation where a signal s[n] passes through a

channel with impulse response h[n] and is further
corrupted by noise w[n]:
x[n] = h(n) ∗ s(n) + w[n]

X
K
= h[k]s[n − k] + w[n], n = 0, 1, . . . , N − 1
k=0
• Since convolution commutes, we can write this as
nX
s −1
x[n] = h[n − k]s[k] + w[n]

k=0
• In matrix form this is expressed by

0 ··· 0
      
x[0] h[0] s[0] w[0]
 x[1]   h[1] h[0] ··· 0   s[1]   w[1] 
= +

..
 
.. .. ..

..
 
..

 ..  
 .   . . . .  .   . 
x[N − 1] h[N − 1] h[N − 2] ··· h[N − ns ] s[ns − 1] w[N − 1]
• Thus, we have again the linear model
x = Hs + w
where the unknown parameter θ is the original signal s.

• The noise is assumed Gaussian: w[n] ∼ N(0, σ2 ).
• A reasonable assumption for the signal is that s ∼ N(0, Cs )
with [Cs ]ij = rss [i − j], where rss is the autocorrelation
function of s.
• According to slide 28, the MMSE estimator is
E(s | x) = µs + Cs HT (HCs HT + Cw )−1 (x − Hµs )

= Cs HT (HCs HT + σ2 I)−1 x
• In general, the form of the estimator varies a lot between

different cases. However, as a special case:
• When H = I, the channel is identity and only noise is
present. In this case
ŝ = Cs (Cs + σ2 I)−1 x
This case is called the Wiener filter. For example, in a single

data point case,
rss [0]
ŝ[0] = x[0]
rss [0] + σ2
Thus, the variance of the noise is used as a parameter

telling the reliability of the data with respect to the prior.

Lecture10 PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lecture10 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Bayesian Estimation

• Bayesian estimators differ from all classical estimators studied

• For example, we might know that the normalized

as the prior PDF in the Bayesian framework.

• One of the key properties of Bayesian approach is that it

• The Bayesian approach can circumvent this problem,

p(x | µ) = µ#heads (1 − µ)#tails

with #heads = 3 and #tails = 0. The maximum of the

• The third curve is the posterior density p(µ | x) after

p(x | µ) · p(µ) likelihood · prior

Likelihood function after three tosses resulting in a head

• Bayesian estimators are defined by a minimization

which seeks for the value of θ̂ that minimizes the average

• The cost function C(x) is typically one of the following

2. Absolute: C(x) = |x|

• These three cost functions are favoured, because we can

• Because p(x) is always nonnegative, it suffices to minimize

• If we select the quadratic cost, then the Bayesian estimator

• Simple differentiation gives:

• Setting this equal to zero gives

• Thus, we have the minimum:

i.e., the mean of posterior PDF, p(θ | x).2

• If we choose the absolute value as the cost function, we

• This can be shown to be equivalent to the following

• In other words, the estimate is the value which divides the

• Thus, we have arrived at the definition of the median of the

• For the hit-or-miss case, we also need to minimize the

• The integral becomes

• This is minimized by maximizing

• For small δ and smooth p(θ | x) the maximum of the

• Note, that the MAP estimator

θ̂MAP = arg max p(θ | x)

is calculated as (using the Bayes’ rule):

• Since p(x) does not depend on theta, it is equivalent to

θ̂MAP = arg max p(x | θ)p(θ)

• Incidentally, this is close to the ML estimator:

θ̂ML = arg max p(x | θ)

The only difference is the inclusion of the prior PDF.

• To summarize, the three most widely used Bayesian

• Consider the case of tossing a coin for three times resulting

• Now the µ̂MAP becomes

µ̂MAP = arg max p(x | µ)p(µ)

• Let’s simplify the arithmetic by setting # heads = 3 and #

• Equivalently, we can maximize it’s logarithm:

• If we substitute the value used in the example, σ = 0.1,

• In vector parameter case, the MMSE estimator is

• In the linear model case, there exists a straightforward

where θ ∼ N(µθ , Cθ ) and w ∼ N(0, Cw ), then

E(θ | x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )

• It is possible to derive an alternative form resembling the

• Note that this becomes the LS estimator if µθ = 0 and

• The MAP estimator can also be extended to vector

or, using the Bayes’ rule,

θ̂MAP = arg max p(x | θ)p(θ)

• Note, that in general this is different from p scalar MAP’s.

• Consider the data model

x[n] = a cos 2πf0 n+b sin 2πf0 n+w[n], n = 0, 1, . . . , N−1

• We depart from the classical model by assuming that a and

or, in our case,3

• In earlier examples we have seen that the columns of H are

• In all, the MMSE estimates become

• For comparison, recall that the classical MVU estimator is

• The difference can be interpreted as a weighting between