Chapter 2 (Part 2) : Bayesian Decision Theory (Sections 2.3-2.5)

Chapter 2 (Part 2):
Bayesian Decision
Theory
(Sections 2.3-2.5)
Classifiers, Discriminant Functions and Decision Surfaces
Error Bounds
Bayesian Decision Theory
• Generalization of the preceding ideas

• Use of more than one feature
• Use more than two states of nature
• Allowing actions other than decide on the state of nature
• Allowing actions other than classification primarily allows the possibility
of rejection
• Refusing to make a decision in close or bad cases!
• Introduce a loss function which is more general than the probability
of error
• The loss function states how costly each action taken is
Pattern Classification, Chapter 2 (Part 1) 2

The Normal Density
• Univariate density
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or
prototype corrupted by random process (central limit
theorem)
1 ( "
1 x−µ %
2+
2
N(x; µ, σ ) = exp *− $ '-
2π σ *) 2 # σ & -,

The Normal Density
• Multivariate density: Multivariate normal density in d
dimensions is:
2 1 # 1 t −1 &
N(x; µ, σ ) = 1/2
exp %− (x − µ ) Σ (x − µ )(
(2π )d/2 Σ $ 2 '
x = (x1, x2, …, xd)t
µ = (µ1, µ2, …, µd)t mean vector
S = d*d covariance matrix
|S| and S-1 are determinant and inverse respectively
Discriminant Functions
for the Normal Density
• Minimum error-rate classification can be achieved by the

discriminant function
• gi(x) = ln P(x | wi) + ln P(wi)
• Case of multivariate normal

1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2
6 Pattern Classification, Chapter 2 (Part 3)

€
for the Normal Density…
• Case Si = s2.I (I stands for the identity matrix)

• sij = 0 (statistically independent) and sii are same for all features
1 t −1 d 1
2 2 2

• Disregarding xtx, we get a linear discriminant function
gi (x) = w it x + w i0
where :
µi 1 t
w i = 2 ; w i0 = − 2 µi µi + ln P(ω i )
σ 2σ
€
(ω i0 is called the threshold for the ith category!)
€
• A classifier that uses linear discriminant functions is called
“a linear machine”
• The decision surfaces for a linear machine are hyperplanes

defined by gi(x) = gj(x)
Pattern Classification, Chapter 2 (Part

9
3)
• The hyperplane separating R and R
i j
1 s2 P( w i )
x0 = ( µ i + µ j ) - ln ( µi - µ j )
2 µi - µ j
2
P( w j )
is always orthogonal to the line linking the means!

1
if P(ω i ) = P(ω j ) then x 0 = ( µi + µ j )
2
2
gi (x) = − x − µ i
• Case Si = S (covariance of all classes are identical but
arbitrary!)
1 t −1 d 1
2 2 2
• Expand the term and disregard the quadratic expression
€ t
where :
gi (x) = w x + w i0
i 1 t −1
−1
w i = ∑ µ; w i0 = − µi ∑ µi + ln P(ω i )
2

€
Discriminant Functions for
the Normal Density…
1
x 0 = ( µi + µ j ) −
[
ln P(ω i ) /P(ω j ) ] .( µi − µ j )
t −1
2 ( µi − µ j ) Σ ( µi − µ j )
• Comments about this hyperplane:

€ – It passes through x0
– It is NOT orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ωj), then x0 shifts away from the more likely
mean.
1 t −1 d 1
2 2 2
• When P(ωi) is the same for each of the c classes

€ • Case I: Euclidean distance classifier
2
gi (x) = − x − µ i
• Case II: Mahalanobis distance classifier
1
€ gi (x) = − (x − µi ) t Σ −1
i (x − µi )
2
Mahalanobis
Distance
Covariance Matrix:
é0.3 0.2ù
S=ê ú
C ë 0.2 0.3û
B
B
A: (0.5, 0.5)
AA B: (0, 1)
C: (1.5, 1.5)
Mahalanobis
Distance
Covariance Matrix:
é0.3 0.2ù
S=ê ú
ë 0.2 0.3û
C
A: (0.5, 0.5)
B B: (0, 1)
B
C: (1.5, 1.5)
A
A
Euclid(A,B) = 0.5
Euclid(A,B) = 2
Mahal(A,B) = 5
Mahal(A,C) = 4
• Case Si = arbitrary
• The covariance matrices are different for each category
g i ( x ) = x tWi x + w it x = w i 0
where :
1 -1
Wi = - S i
2
w i = S i- 1 µ i
1 t -1 1
w i 0 = - µ i S i µ i - ln S i + ln P ( w i )
2 2

21
3)
P(ω1)=P(ω2)
Disconnected
decision
regions
23
3)
24
3)
the Normal Density
Decision Regions for Two-
Dimensional Gaussian Data
Decision Regions for Two-
Dimensional Gaussian Data
Error Probabilities and Integrals
Error Probabilities and Integrals
• 2-class problem: There are two types of errors
• Multi-class problem
– Simpler to computer the prob. of being correct (more ways to be
wrong than to be right)
Error Bounds for Normal Densities
• Bayes rule guarantees the lowest error rate

but it does not tell us the actual error
• In the 2-category case the general error can be
approximated analytically to give us an upper
bound on the error
• However, the exact calculation of error for the
general Gaussian case (case 3) is extremely
difficult
Chernoff Bound
• Inequality:
Chernoff Bound
• Inequality:
Chernoff bound
• To derive a bound for the error,
Chernoff bound
Assuming conditional prob. are normal,
where
Chernoff bound for P(error) is found by determining the value of b

that minimizes exp(-k(b))
Chernoff bound
1D optimization regardless of the dimensionality of the features

Bhattacharya bound
• Bhattacharyya Bound: Assume b = 1/2
• computationally simpler but slightly less tight bound
When the two covariance matrices are equal, k(1/2) is the same
as the Mahalanobis distance between the two means
Bhattacharya bound
Error bounds for
gaussian distribution
Best Chernoff error bound is 0.008190
Bhattacharya error bound is 0.008191
True error using numerical integration =

0.0021
2–category, 2D data
Pattern
Classification
All materials in these slides were taken

from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher

Chapter 2 (Part 2) : Bayesian Decision Theory (Sections 2.3-2.5)

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Chapter 2 (Part 2) : Bayesian Decision Theory (Sections 2.3-2.5)

Enviado por

Direitos autorais:

Formatos disponíveis

Chapter 2 (Part 2):

Classifiers, Discriminant Functions and Decision Surfaces

• Generalization of the preceding ideas

Pattern Classification, Chapter 2 (Part 1) 2

Pattern Classification, Chapter 2 (Part 2) 3

• Minimum error-rate classification can be achieved by the

• gi(x) = ln P(x | wi) + ln P(wi)

• Case of multivariate normal

6 Pattern Classification, Chapter 2 (Part 3)

• Case Si = s2.I (I stands for the identity matrix)

7 Pattern Classification, Chapter 2 (Part 3)

• The decision surfaces for a linear machine are hyperplanes

Pattern Classification, Chapter 2 (Part

is always orthogonal to the line linking the means!

14 Pattern Classification, Chapter 2 (Part 3)

• Comments about this hyperplane:

• When P(ωi) is the same for each of the c classes

Pattern Classification, Chapter 2 (Part

• Bayes rule guarantees the lowest error rate

Chernoff bound for P(error) is found by determining the value of b

1D optimization regardless of the dimensionality of the features

Best Chernoff error bound is 0.008190

Bhattacharya error bound is 0.008191

True error using numerical integration =

All materials in these slides were taken

Pattern Classification, Chapter 2 (Part 2) 39

Você também pode gostar