Você está na página 1de 39

Chapter 2 (Part 2):

Bayesian Decision
Theory
(Sections 2.3-2.5)

Classifiers, Discriminant Functions and Decision Surfaces

Error Bounds
Bayesian Decision Theory

• Generalization of the preceding ideas


• Use of more than one feature
• Use more than two states of nature
• Allowing actions other than decide on the state of nature
• Allowing actions other than classification primarily allows the possibility
of rejection
• Refusing to make a decision in close or bad cases!
• Introduce a loss function which is more general than the probability
of error
• The loss function states how costly each action taken is

Pattern Classification, Chapter 2 (Part 1) 2


The Normal Density
• Univariate density
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or
prototype corrupted by random process (central limit
theorem)

1 ( "
1 x−µ %
2+
2
N(x; µ, σ ) = exp *− $ '-
2π σ *) 2 # σ & -,

Pattern Classification, Chapter 2 (Part 2) 3


Pattern Classification, Chapter 2 (Part 2) 4
The Normal Density
• Multivariate density: Multivariate normal density in d
dimensions is:

2 1 # 1 t −1 &
N(x; µ, σ ) = 1/2
exp %− (x − µ ) Σ (x − µ )(
(2π )d/2 Σ $ 2 '
x = (x1, x2, …, xd)t
µ = (µ1, µ2, …, µd)t mean vector
S = d*d covariance matrix
|S| and S-1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 2) 5
Discriminant Functions
for the Normal Density

• Minimum error-rate classification can be achieved by the


discriminant function

• gi(x) = ln P(x | wi) + ln P(wi)

• Case of multivariate normal


1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

6 Pattern Classification, Chapter 2 (Part 3)



Discriminant Functions
for the Normal Density…

• Case Si = s2.I (I stands for the identity matrix)


• sij = 0 (statistically independent) and sii are same for all features

1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

7 Pattern Classification, Chapter 2 (Part 3)


Discriminant Functions
for the Normal Density…
• Disregarding xtx, we get a linear discriminant function

gi (x) = w it x + w i0
where :
µi 1 t
w i = 2 ; w i0 = − 2 µi µi + ln P(ω i )
σ 2σ

(ω i0 is called the threshold for the ith category!)


Discriminant Functions
for the Normal Density…
• A classifier that uses linear discriminant functions is called
“a linear machine”

• The decision surfaces for a linear machine are hyperplanes


defined by gi(x) = gj(x)

Pattern Classification, Chapter 2 (Part


9
3)
Discriminant Functions
for the Normal Density…
• The hyperplane separating R and R
i j

1 s2 P( w i )
x0 = ( µ i + µ j ) - ln ( µi - µ j )
2 µi - µ j
2
P( w j )

is always orthogonal to the line linking the means!


1
if P(ω i ) = P(ω j ) then x 0 = ( µi + µ j )
2
2
gi (x) = − x − µ i
10 Pattern Classification, Chapter 2 (Part 3)
11 Pattern Classification, Chapter 2 (Part 3)
12 Pattern Classification, Chapter 2 (Part 3)
13 Pattern Classification, Chapter 2 (Part 3)
Discriminant Functions
for the Normal Density…
• Case Si = S (covariance of all classes are identical but
arbitrary!)

1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2
• Expand the term and disregard the quadratic expression

€ t
where :
gi (x) = w x + w i0
i 1 t −1
−1
w i = ∑ µ; w i0 = − µi ∑ µi + ln P(ω i )
2

14 Pattern Classification, Chapter 2 (Part 3)



Discriminant Functions for
the Normal Density…

1
x 0 = ( µi + µ j ) −
[
ln P(ω i ) /P(ω j ) ] .( µi − µ j )
t −1
2 ( µi − µ j ) Σ ( µi − µ j )

• Comments about this hyperplane:


€ – It passes through x0
– It is NOT orthogonal to the line linking the means.
– What happens when P(ωi)= P(ωj) ?
– If P(ωi)= P(ωj), then x0 shifts away from the more likely
mean.
Discriminant Functions for
the Normal Density…
1 t −1 d 1
gi (x) = − (x − µi ) Σ i (x − µi ) − ln2π − ln Σ i + ln P(ω i )
2 2 2

• When P(ωi) is the same for each of the c classes


€ • Case I: Euclidean distance classifier
2
gi (x) = − x − µ i
• Case II: Mahalanobis distance classifier
1
€ gi (x) = − (x − µi ) t Σ −1
i (x − µi )
2
Mahalanobis
Distance
Covariance Matrix:

é0.3 0.2ù
S=ê ú
C ë 0.2 0.3û
B
B
A: (0.5, 0.5)
AA B: (0, 1)
C: (1.5, 1.5)
Mahalanobis
Distance
Covariance Matrix:

é0.3 0.2ù
S=ê ú
ë 0.2 0.3û
C
A: (0.5, 0.5)
B B: (0, 1)
B
C: (1.5, 1.5)
A
A

Euclid(A,B) = 0.5
Euclid(A,B) = 2
Mahal(A,B) = 5
Mahal(A,C) = 4
19 Pattern Classification, Chapter 2 (Part 3)
20 Pattern Classification, Chapter 2 (Part 3)
Discriminant Functions for
the Normal Density…
• Case Si = arbitrary
• The covariance matrices are different for each category
g i ( x ) = x tWi x + w it x = w i 0
where :
1 -1
Wi = - S i
2
w i = S i- 1 µ i
1 t -1 1
w i 0 = - µ i S i µ i - ln S i + ln P ( w i )
2 2

Pattern Classification, Chapter 2 (Part


21
3)
Discriminant Functions for
the Normal Density…

P(ω1)=P(ω2)
Disconnected
decision
regions
Pattern Classification, Chapter 2 (Part
23
3)
Pattern Classification, Chapter 2 (Part
24
3)
Discriminant Functions for
the Normal Density
Decision Regions for Two-
Dimensional Gaussian Data
Decision Regions for Two-
Dimensional Gaussian Data
Error Probabilities and Integrals
Error Probabilities and Integrals
• 2-class problem: There are two types of errors

• Multi-class problem
– Simpler to computer the prob. of being correct (more ways to be
wrong than to be right)
Error Bounds for Normal Densities

• Bayes rule guarantees the lowest error rate


but it does not tell us the actual error
• In the 2-category case the general error can be
approximated analytically to give us an upper
bound on the error
• However, the exact calculation of error for the
general Gaussian case (case 3) is extremely
difficult
Chernoff Bound

• Inequality:
Chernoff Bound

• Inequality:
Chernoff bound
• To derive a bound for the error,
Chernoff bound
Assuming conditional prob. are normal,

where

Chernoff bound for P(error) is found by determining the value of b


that minimizes exp(-k(b))
Chernoff bound

1D optimization regardless of the dimensionality of the features


Bhattacharya bound
• Bhattacharyya Bound: Assume b = 1/2
• computationally simpler but slightly less tight bound

When the two covariance matrices are equal, k(1/2) is the same
as the Mahalanobis distance between the two means
Bhattacharya bound
Error bounds for
gaussian distribution

Best Chernoff error bound is 0.008190

Bhattacharya error bound is 0.008191

True error using numerical integration =


0.0021
2–category, 2D data
Pattern
Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher

Pattern Classification, Chapter 2 (Part 2) 39

Você também pode gostar