Você está na página 1de 49

Bayesian Estimation

• Bayesian estimators differ from all classical estimators studied


so far in that they consider the parameters as random
variables instead of unknown constants.
• As such, the parameters also have a PDF, which needs to
be taken into account when seeking for an estimator.
• The PDF of the parameters can be used for incorporating
any prior knowledge we may have about its value.
Bayesian Estimation

• For example, we might know that the normalized


frequency f0 of an observed sinusoid cannot be greater
than 0.1. This is ensured by choosing

10, if 0 6 f0 6 0.1
p(f0 ) =
0, otherwise

as the prior PDF in the Bayesian framework.


• Usually differentiable PDF’s are easier, and we could
approximate the uniform PDF with, e.g., the Rayleigh PDF.
Uniform density Rayleigh density with σ = 0.035
20
10
15
Prior

Prior

10
5
5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized frequency f0 Normalized frequency f0
Prior and Posterior estimates

• One of the key properties of Bayesian approach is that it


can be used also for small data records, and the estimate
can be improved sequentially as new data arrives.
• For example, consider tossing a coin and estimating the
probability of a head, µ.
• As we saw earlier, the ML estimate is the number of
observed heads divided by total number of tosses:
#heads
µ̂ = #tosses .
• However, if we can not afford to make more than, say,
three experiments, we may end up seeing three heads and
no tails. Thus, we are forced to infer that µ̂ML = 1, the coin
lands always as a head.
Prior and Posterior estimates

• The Bayesian approach can circumvent this problem,


because the prior regularizes the likelihood and avoids
overfitting to the small amount of data.
• The pictures below illustrate this. The one on the top is the
likelihood function

p(x | µ) = µ#heads (1 − µ)#tails

with #heads = 3 and #tails = 0. The maximum of the


function is at unity.
• The second curve is the prior density p(µ) of our choice. It
was selected to reflect the fact that we assume that the coin
is probably quite fair.
Prior and Posterior estimates

• The third curve is the posterior density p(µ | x) after


observing the samples, which can be evaluated using the
Bayes formula

p(x | µ) · p(µ) likelihood · prior


p(µ | x) = =
p(x) p(x)

• Thus, the third curve is the product of the first two (with
normalization), and one Bayesian alternative is to use the
maximum as the estimate.
Prior and Posterior estimates

Likelihood function after three tosses resulting in a head


1

p( x|µ) 0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Prior density before observing any data
0.05
p(µ)

0.025

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Posterior density after observing 3 heads
0.1
p(µ| x)

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ
Cost Functions

• Bayesian estimators are defined by a minimization


problem
ZZ
θ̂ = arg min C(θ − θ̂)p(x, θ) dx dθ
θ̂

which seeks for the value of θ̂ that minimizes the average


cost.
Cost Functions

• The cost function C(x) is typically one of the following

1. Quadratic: C(x) = x2

2. Absolute: C(x) = |x|



0, |x| < δ
3. Hit-or-miss: C(x) =
1, |x| > δ
• Additional cost functions include Huber’s robust loss and
-insensitive loss.
Cost Functions

• These three cost functions are favoured, because we can


find the minimum cost solution in closed form. We will
introduce the solutions next.
• Functions 1 and 3 are slightly easier to use than 2. Thus,
we’ll concentrate on those.
• Regardless of the cost function, the above double integral
can be evaluated and minimized using the rule for joint
probabilities:
p(x, θ) = p(θ | x)p(x).
Cost Functions

• This results in
ZZ Z Z 
C(θ−θ̂)p(θ | x)p(x) dx dθ = C(θ − θ̂)p(θ | x) dθ p(x) dx
| {z }
(∗)

• Because p(x) is always nonnegative, it suffices to minimize


the multiplier inside the brackets, (∗)1 :
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂

1
Note, that there’s a slight shift in the paradigm. The double integral results in the theoretical estimate that
requires the knowledge of p(x). When minimizing only the inner integral, we get the optimum for a particular
realization, not all possible realizations.
1. Quadratic Cost Solution (or the MMSE
estimator)

• If we select the quadratic cost, then the Bayesian estimator


is defined by
Z
arg min (θ − θ̂)2 p(θ | x) dθ
θ̂

• Simple differentiation gives:


Z Z
∂ ∂ 
(θ − θ̂) p(θ | x) dθ =
2
(θ − θ̂)2 p(θ | x) dθ

∂θ̂ Z ∂θ̂
= −2(θ − θ̂) p(θ | x) dθ
1. Quadratic Cost Solution (or the MMSE
estimator)

• Setting this equal to zero gives


Z
−2(θ − θ̂) p(θ | x) dθ = 0
Z Z
⇔ 2θ̂ p(θ | x) dθ = 2θ p(θ | x) dθ
Z Z
⇔ θ̂ p(θ | x) dθ = θ p(θ | x) dθ
| {z }
=1
Z
⇔ θ̂ = θ p(θ | x) dθ
1. Quadratic Cost Solution (or the MMSE
estimator)

• Thus, we have the minimum:


Z
θ̂MMSE = θp(θ | x) dθ = E(θ | x),

i.e., the mean of posterior PDF, p(θ | x).2


• This is called the minimum mean square error estimator
(MMSE estimator), because it minimizes the average
squared error.

2
Prior PDF, p(θ), refers to the parameter distribution before any observations are made. Posterior PDF, p(θ | x),
refers to the parameter distribution after observing the data.
2. Absolute Cost Solution

• If we choose the absolute value as the cost function, we


have to minimize
Z
arg min θ − θ̂ p(θ | x) dθ

θ̂

• This can be shown to be equivalent to the following


condition
Z θ̂ Z∞
p(θ | x) dθ = p(θ | x) dθ
−∞ θ̂
2. Absolute Cost Solution

• In other words, the estimate is the value which divides the


probability mass into equal proportions:
Z θ̂
1
p(θ | x) dθ =
−∞ 2

• Thus, we have arrived at the definition of the median of the


posteriori PDF.
3. Hit-or-miss Cost Solution (or the MAP
estimator)

• For the hit-or-miss case, we also need to minimize the


inner integral:
Z
θ̂ = arg min C(θ − θ̂)p(θ | x) dθ
θ̂

with 
0, |x| < δ
C(x) =
1, |x| > δ
3. Hit-or-miss Cost Solution (or the MAP
estimator)

• The integral becomes

Z Z θ̂−δ Z∞
C(θ− θ̂)p(θ | x) dθ = 1·p(θ | x) dθ+ 1·p(θ | x) dθ
−∞ θ̂+δ

or in a simplified form
Z Z θ̂+δ
C(θ − θ̂)p(θ | x) dθ = 1 − 1 · p(θ | x) dθ
θ̂−δ
3. Hit-or-miss Cost Solution (or the MAP
estimator)

• This is minimized by maximizing

Z θ̂+δ
p(θ | x) dθ
θ̂−δ

• For small δ and smooth p(θ | x) the maximum of the


integral occurs at the maximum of p(θ | x).
• Therefore, the estimator is the mode (the highest value) of
the posteriori PDF. Thus the name Maximum a Posteriori
(MAP) estimator.
3. Hit-or-miss Cost Solution (or the MAP
estimator)

• Note, that the MAP estimator

θ̂MAP = arg max p(θ | x)


θ

is calculated as (using the Bayes’ rule):

p(x | θ)p(θ)
θ̂MAP = arg max
θ p(x)
3. Hit-or-miss Cost Solution (or the MAP
estimator)

• Since p(x) does not depend on theta, it is equivalent to


maximize only the numerator:

θ̂MAP = arg max p(x | θ)p(θ)


θ

• Incidentally, this is close to the ML estimator:

θ̂ML = arg max p(x | θ)


θ

The only difference is the inclusion of the prior PDF.


Summary

• To summarize, the three most widely used Bayesian


estimators are
1 The MMSE, θ̂MMSE = E(θ | x)
Rθ̂
2 The Median, or θ̂ with −∞ p(θ | x) dθ = 12 .
3 The MAP, θ̂MAP = arg maxθ p(x | θ)p(θ)
Example

• Consider the case of tossing a coin for three times resulting


in three heads.
• In the example, we used the Gaussian prior
 
1 1 2
p(µ) = √ exp − 2 (µ − 0.5) .
2πσ2 2σ

• Now the µ̂MAP becomes

µ̂MAP = arg max p(x | µ)p(µ)


µ
  
#heads #tails 1 1 2
= arg max µ (1 − µ) √ exp − 2 (µ − 0.5)
µ 2πσ2 2σ
Example

• Let’s simplify the arithmetic by setting # heads = 3 and #


tails = 0:
  
3 1 1 2
µ̂MAP = arg max µ √ exp − 2 (µ − 0.5)
µ 2πσ2 2σ

• Equivalently, we can maximize it’s logarithm:


 √ 1

arg max 3 ln µ − ln 2πσ2 − 2 (µ − 0.5)2
µ 2σ
Example

• Now,

∂ 3 (µ − 0.5)
ln [p(x|µ)p(µ)] = − = 0,
∂µ µ σ2

when
µ2 − 0.5µ − 3σ2 = 0.
This happens when
p √
0.5 ± 0.25 − 4 · 1 · (−3σ2 ) 0.25 + 12σ2
µ= = 0.25 ± .
2 2
Example

• If we substitute the value used in the example, σ = 0.1,



0.37
µ̂MAP = 0.25 + ≈ 0.554.
2
• Thus, we have found the analytical solution of the
maximum of the curve in slide 5.
Vector Parameter Case for MMSE

• In vector parameter case, the MMSE estimator is

θ̂MMSE = E(θ | x)

or more explicitly
R
R θ1 p(θ | x) dθ

 θ2 p(θ | x) dθ 
θ̂MMSE =
 
.. 

R . 
θp p(θ | x) dθ
Vector Parameter Case for MMSE

• In the linear model case, there exists a straightforward


solution:
If the observed data can be modeled as

x = Hθ + w,

where θ ∼ N(µθ , Cθ ) and w ∼ N(0, Cw ), then

E(θ | x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )


Vector Parameter Case for MMSE

• It is possible to derive an alternative form resembling the


LS estimator (exercise):

E(θ | x) = µθ + (C−1 T −1 −1 T −1
θ + H Cw H) H Cw (x − Hµθ ).

• Note that this becomes the LS estimator if µθ = 0 and


Cθ = I and Cw = σ2w I.
Vector Parameter Case for the MAP

• The MAP estimator can also be extended to vector


parameters:
θ̂MAP = arg max p(θ | x)
θ

or, using the Bayes’ rule,

θ̂MAP = arg max p(x | θ)p(θ)


θ

• Note, that in general this is different from p scalar MAP’s.


Scalar MAP would maximize for each parameter θi
individually, but the vector MAP seeks for the global
maximum of the vector space.
Example: MMSE Estimation of Sinusoidal
Parameters

• Consider the data model

x[n] = a cos 2πf0 n+b sin 2πf0 n+w[n], n = 0, 1, . . . , N−1

or in vector form
x = Hθ + w,
where
 
1 0
 cos 2πf0 sin 2πf0   
a
 
H=
 cos 4πf0 sin 4πf0 
and θ=
b

 .. 
 . 
cos(2(N − 1)πf0 ) sin(2(N − 1)πf0 )
Example: MMSE Estimation of Sinusoidal
Parameters

• We depart from the classical model by assuming that a and


b are random variables with prior PDF θ ∼ N(0, σ2θ I). Also
w is assumed Gaussian (N(0, σ2 )) and independent of θ.
• Using the second version of the formula for the linear
model (on slide 28), we get the MMSE estimator:

E(θ | x) = µθ + (C−1 T −1 −1 T −1
θ + H Cw H) H Cw (x − Hµθ )
Example: MMSE Estimation of Sinusoidal
Parameters

or, in our case,3


 −1
1 T 1 1
E(θ | x) = 2
I + H 2 IH HT 2 Ix
σθ σw σw
 −1
1 1 1
= 2
I + 2 HT H HT 2 x
σθ σw σw

3
Note the correspondence with Ridge regression. It holds that Ridge regression is equivalent to the Bayesian
estimator with Gaussian prior for the coefficients. It also holds that the LASSO is equivalent to the Bayesian
estimator with Laplacian prior.
Example: MMSE Estimation of Sinusoidal
Parameters

• In earlier examples we have seen that the columns of H are


nearly orthogonal (exactly orthogonal if f0 = k/N):

N
HT H ≈ I
2
• Thus,
 −1
1 N 1
E(θ | x) ≈ 2
I+ 2 I HT x
σθ 2σw σ2w
1
σ2w
= 1 N
HT x.
σ2θ
+ 2σ2w
Example: MMSE Estimation of Sinusoidal
Parameters

• In all, the MMSE estimates become

2 X
N−1
" #
1
âMMSE = 2σ2 /N
x[n] cos 2πf0 n
1+ N
σ2θ n=0

2 X
" N−1 #
1
b̂MMSE = 2σ2 /N
x[n] sin 2πf0 n
1+ N
σ2θ n=0
Example: MMSE Estimation of Sinusoidal
Parameters

• For comparison, recall that the classical MVU estimator is

2 X
N−1
âMVU = x[n] cos 2πf0 n
N
n=0

2 X
N−1
b̂MVU = x[n] sin 2πf0 n
N
n=0
Example: MMSE Estimation of Sinusoidal
Parameters

• The difference can be interpreted as a weighting between


the prior knowledge and the data.
• If the prior knowledge is unreliable (σ2θ large), then
1
2 ≈ 1 and the two estimators are almost equal.
1+ 2σ 2/N
σ
θ
• If the data is unreliable (σ2 large), then the coefficient
1
2 is small, making the estimate close to the mean of
1+ 2σ 2/N
σ
θ
the prior PDF.
Example: MMSE Estimation of Sinusoidal
Parameters

• An example run is illustrated below. In this case, N = 100,


f0 = 15/N, and σ2θ = 0.48566, σ2 = 4.1173. Altogether
M = 500 tests were performed.
• Since the prior PDF has a small variance, the estimator
gains a lot from using it. This is seen as a significant
difference between the MSE’s of the two estimators.
Example: MMSE Estimation of Sinusoidal
Parameters

Classical estimator of a. MSE=0.072474 Classical estimator of b. MSE=0.092735


60 60

50 50

40 40

30 30

20 20

10 10

0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Bayesian estimator of a. MSE=0.061919 Bayesian estimator of b. MSE=0.076355


60 60

50 50

40 40

30 30

20 20

10 10

0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Example: MMSE Estimation of Sinusoidal
Parameters

• If the prior has a higher variance, the Bayesian approach


does not perform that much better. In the pictures below,
σ2θ = 2.1937, σ2 = 1.9078. The difference in performance is
negligible between the two approaches.
Example: MMSE Estimation of Sinusoidal
Parameters

Classical estimator of a. MSE=0.040066 Classical estimator of b. MSE=0.034727


60 60

50 50

40 40

30 30

20 20

10 10

0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Bayesian estimator of a. MSE=0.03951 Bayesian estimator of b. MSE=0.034477


60 60

50 50

40 40

30 30

20 20

10 10

0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Example: MMSE Estimation of Sinusoidal
Parameters

• The program code is available at


http://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m
Example: MAP Estimator

• Assume that

θ exp(−θx[n]) if x[n] > 0
p(x[n] | θ) =
0, if x[n] < 0

with x[n] conditionally IID and the prior of θ:



λ exp(−λθ) if θ > 0
p(θ) =
0 if θ < 0

• Now, θ is the unkown RV and λ is known.


Example: MAP Estimator

• Then the MAP estimator is found by maximizing p(θ | x)


or equivalently p(x | θ)p(θ).
• Because both PDF’s have an exponential form, it’s easier to
maximize the logarithm instead:

θ̂ = arg max (ln p(x | θ) + ln p(θ)) .


θ
Example: MAP Estimator

• Now,

Y
"N−1 #
ln p(x | θ) + ln p(θ) = ln θ exp(−θx[n]) + ln[λ exp(−λθ)]
n=0

X
" N−1
!#
N
= ln θ exp −θ x[n] + ln[λ exp(−λθ)]
n=0
= N ln θ − Nθx̄ + ln λ − λθ

• Differentiation produces

d N
ln p(x | θ) + ln p(θ) = − Nx̄ − λ
dθ θ
Example: MAP Estimator

• Setting it equal to zero produces the MAP estimator:

1
θ̂ = λ
x̄ + N
Example: Deconvolution

• Consider the situation where a signal s[n] passes through a


channel with impulse response h[n] and is further
corrupted by noise w[n]:

x[n] = h(n) ∗ s(n) + w[n]


X
K
= h[k]s[n − k] + w[n], n = 0, 1, . . . , N − 1
k=0
Example: Deconvolution

• Since convolution commutes, we can write this as

nX
s −1

x[n] = h[n − k]s[k] + w[n]


k=0

• In matrix form this is expressed by


0 ··· 0
      
x[0] h[0] s[0] w[0]
 x[1]   h[1] h[0] ··· 0   s[1]   w[1] 
= +

..
 
.. .. ..

..
 
..

 ..  
 .   . . . .  .   . 
x[N − 1] h[N − 1] h[N − 2] ··· h[N − ns ] s[ns − 1] w[N − 1]
Example: Deconvolution

• Thus, we have again the linear model

x = Hs + w

where the unknown parameter θ is the original signal s.


• The noise is assumed Gaussian: w[n] ∼ N(0, σ2 ).
• A reasonable assumption for the signal is that s ∼ N(0, Cs )
with [Cs ]ij = rss [i − j], where rss is the autocorrelation
function of s.
• According to slide 28, the MMSE estimator is

E(s | x) = µs + Cs HT (HCs HT + Cw )−1 (x − Hµs )


= Cs HT (HCs HT + σ2 I)−1 x
Example: Deconvolution

• In general, the form of the estimator varies a lot between


different cases. However, as a special case:
• When H = I, the channel is identity and only noise is
present. In this case

ŝ = Cs (Cs + σ2 I)−1 x

This case is called the Wiener filter. For example, in a single


data point case,

rss [0]
ŝ[0] = x[0]
rss [0] + σ2

Thus, the variance of the noise is used as a parameter


telling the reliability of the data with respect to the prior.

Você também pode gostar