Você está na página 1de 29

Lectures 5 & 6: Classifiers

Hilary Term 2007 A. Zisserman

Bayesian Decision Theory


Bayes decision rule
Loss functions
Likelihood ratio test

Classifiers and Decision Surfaces


Discriminant function
Normal distributions

Linear Classifiers
The Perceptron
Logistic Regression

Decision Theory
Suppose we wish to make measurements on a medical image and
classify it as showing evidence of cancer or not

C1 cancer

image x
image decision C2 no cancer
processing rule

measurement

and we want to base this decision on the learnt joint distribution

p(x, Ci ) = p(x|Ci)p(Ci )
How do we make the best decision?
Classification
Assign input vector x to one of two or more classes Ck

Any decision rule divides input space into decision regions separated
by decision boundaries

Example: two class decision depending on a 2D vector measurement

Also, would like a confidence measure (how sure are we that the
input belongs to the chosen category?)
Decision Boundary for average error
Consider a two class decision depending
on a scalar variable x x0 x
^

p(x,C1)
Z + p(x,C2)
p(error) = p(error, x) dx

Z Z
= p(x, C2) dx + p(x, C1) dx x
R1 R2
R1 R2

minimize number of misclassifications if the decision boundary is at x0

Bayes Decision rule


Assign x to the class Ci for which p(x, Ci) is largest

since p(x, Ci) = p(Ci|x) p(x) this is equivalent to


Assign x to the class Ci for which p( Ci | x ) is largest

Bayes error
A classifier is a mapping from a vector x
to class labels {C1, C2}
x0 x
^

p(x,C1)

Z + p(x,C2)
p(error) = p(error, x) dx

Z Z x
= p(x, C2) dx + p(x, C1) dx
R1 R2
R1 R2
Z Z
= p(C2|x)p(x) dx + p(C1|x)p(x) dx
R1 R2

The Bayes error is the probability of misclassification


Example: Iris recognition

How Iris Recognition Works, John Daugman


IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR
VIDEO TECHNOLOGY, VOL. 14, NO. 1, JANUARY 2004
Posteriors
Assign x to the class Ci for which p( Ci | x ) is largest

5
p(C1|x) p(C |x)
p(x|C2) 2
4 1

posterior probabilities
class densities

0.8
3
0.6 sum
2
p(x|C1) 0.4 to 1
1 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

p(C1|x) + p(C2|x) = 1,
so p(C2|x) = 1 p(C1|x)
i.e. class i if p(Ci |x) > 0.5

Reject option
avoid making decisions if unsure

p(C |x) p(C2|x)


1
1
posterior probabilities

0.8
0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
x
reject region

reject if posterior probability p(Ci|x) <


Example skin detection in video

Objective: label skin pixels (as a means to detect humans)


Two stages:
1. Training: learn likelihood for pixel colour, given skin and non-skin
pixels
2. Testing: classify a new image into skin regions

training image training skin pixel mask masked pixels

Choice of colour space

- chromaticity color space: r=R/(R+G+B), g=G/(R+G+B)


- invariant to scaling of R,G,B, plus 2D for visualisation

chromaticity color space


1

0.9

0.8

0.7

0.6
g=G/(R+G+B)

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)
skin pixels in chromaticity space
1

0.9

0.8

0.7

0.6

g=G/(R+G+B)
0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)

Represent likelihood as Normal Distribution



1 1
N (x|, ) = n/2 1/2
exp (x )>1(x )
(2 ) || 2

p(x|background) p(x|skin)
1 1
120
1200
0.9 0.9

100 0.8 1000


0.8

0.7 0.7
80 800
0.6
g=G/(R+G+B)

0.6
g=G/(R+G+B)

0.5
0.5 60 600

0.4
0.4
40 0.3 400
0.3
0.2
0.2 200
20
0.1
0.1
0 0
0 0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 r=R/(R+G+B)
r=R/(R+G+B)

Gaussian fitted to background pixels Gaussian fitted to skin pixels


contours of p(x|skin/background)

0.45

0.4
g=G/(R+G+B)

0.35

0.3

0.25

0.2

0.3 0.35 0.4 0.45 0.5 0.55


r=R/(R+G+B)

contours of two Gaussians 3D view of two Gaussians


vertical axis is likelihood

Posterior probability of skin given pixel colour


Posterior probability of skin is dened by Bayes rule:
p(x|skin)P (skin)
P (skin|x) =
p(x)
where

p(x) = p(x|skin)P (skin) + p(x|background)P (background)


i.e. the marginal pdf of x

Assume equal prior probabilities, i.e. probability


of a pixel being skin is 0.5.

NB: the posterior depends on both


foreground and background likelihoods
i.e. it involves both distributions
Assess performance on training image
P(x|background) 120

100

80

input
60

40

20

P(x|skin)
1200

1000

800

600

400

200

P(x|skin) 1
P(skin|x)
1200
0.9

1000 0.8

0.7
800
0.6

600 0.5

0.4

400
0.3

0.2
200
0.1

0 0

likelihood posterior

posterior depends on likelihoods (Gaussians) of both classes


1
P(skin|x) P(skin|x)>0.5

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Test data
p(x|background)

p(x|skin)
p(skin|x)

p(skin|x)>0.5
Test performance on other frames

Receiver Operator Characteristic (ROC) Curve


In many algorithms there is a threshold that affects performance
1
true
positives threshold
decreasing

worse
performance

e.g. 0 false positives 1


true positive: skin pixel classified as skin
false positive: background pixel classified as skin
Loss function revisited
Consider again the cancer diagnosis example. The consequences for an
incorrect classification vary for the following cases:
False positive: does not have cancer, but is classified as having it
> distress, plus unnecessary further investigation

False negative: does have cancer, but is classified as not having it


> no treatment, premature death

The two other cases are true positive and true negative.

Because the consequences of a false negative far outweigh the others,


rather than simply minimize the number of mistakes, a loss function can be
minimized.

X
Risk R(Ci|x) = Lij p(Cj |x)
j
cancer normal

0 1 cancer
Loss matrix Lij = classification
1000 0 normal

truth

true +ve false +ve

false -ve true -ve


Bayes Risk
The class conditional risk of an action is

measurement
X
R(ai|x) = L(ai |Cj )p(Cj |x)
j
action
loss incurred if action i
taken and true state is j

Bayes decision rule: select the action for which R(ai | x) is minimum

Mininimize Bayes risk a i = arg min R(ai|x)



ai

This decision minimizes the expected loss

Likelihood ratio
Two category classification with loss function

Conditional risk

R(a1|x) = L11p(C1|x) + L12p(C2|x)


R(a2|x) = L21p(C1|x) + L22p(C2|x)

Thus for minimum risk, decide C1 if

L 11p(C1|x) + L12p(C2|x) < L21 p(C1|x) + L22p(C2|x)


p(C2|x)(L12 L22) < p(C1|x)(L21 L11)
p(x|C2)p(C2)(L12 L22) < p(x|C1)p(C1)(L21 L11) Bayes
Assuming L21 L11 > 0, then decide C1 if
p(x|C1) p(C2)(L22 L12)
>
p(x|C2) p(C1)(L11 L21)
i.e. likelihood ratio exceeds a threshold that is independent of x
Discriminant functions
A two category classifier can often be written in the
form
(
> 0 assign x to C1
g(x)
< 0 assign x to C2

where g(x) is a discriminant function, and

g(x) = 0
g(x) = 0
is a discriminant surface.

C2 C1
In 2D g(x) = 0 is a set of curves.

Posterior probability of skin given pixel colour


Posterior probability of skin is dened by Bayes rule:
p(x|skin)P (skin)
P (skin|x) =
p(x)
where

p(x) = p(x|skin)P (skin) + p(x|background)P (background)


i.e. the marginal pdf of x

Assume equal prior probabilities, i.e. probability


of a pixel being skin is 0.5.
Example
In the minimum average error classier, the assignment rule is: decide C1
if the posterior p(C1|x) > p(C2|x).

The equivalent discriminant function is

g(x) = p(C1|x) p(C2|x )


or
p(C1|x)
g(x) = ln
p(C2|x)
Note, these two functions are not equal, but the decision boundaries are
the same.

Developing this further


p(C1|x )
g(x) = ln
p(C2|x )
p(x|C1) p(C1)
= ln + ln
p(x|C2) p(C2)

Decision surfaces for Normal distributions


Suppose that the likelihoods are Normal:

p(x|C1) N (1, 1) p(x|C2) N (2, 2)

Then
p(x|C1) p(C1)
g(x) = ln + ln
p(x|C2) p(C2)
p(C1)
= ln p(x|C1) ln p(x|C2) + ln
p(C2)

(x 1)>1 > 1
1 (x 1) + (x 2) 2 (x 2) + c
0

where c0 = ln p(C1) 1 ln | | + 1 ln | |.
p(C )2 2 1 2 2
Case 1: i = 2I

g(x) = (x 1)>(x 1) + (x 2)> (x 2) + 2 2c

Example in 2D
! ! " #
0 1 1 0
1 = 2 = i =
0 0 0 1

g(x) = (x2 + y2) + (x 1)2 + y2 + c


= 2x + c + 1
g(x) = 0
This is a line at x = (c+1)/2
if the priors are equal then c = 0
1
in nD the discriminant surface is a
hyperplane
(2 1) .x = c00
Case 2: i = (covariance matrices are equal)
The discriminant surface

g(x) = (x 1)>1(x 1) + (x 2)>1(x 2) + c0


is also a hyperplane. Why?

Case 3: i = arbitrary
The discriminant surface

g(x) = (x 1)>1 > 1


1 (x 1) + (x 2) 2 (x 2) + c
0

is a conic (2D) or quadric (nD).


e.g. in 3D

The surface can be a hyperboloid, i.e. it need not be closed


Discriminative Methods
So far, we have carried out the following steps in order to
compute a discriminant surface:
1. Measure feature vectors (e.g. in 2D for skin colour) for each class
from training data
2. Learn likelihood pdfs for each class (and priors)
3. Represent likelihoods by fitting Gaussians
4. Compute the posteriors p(Ci | x )
5. Compute the discriminant surface (from the likelihood Gaussians)
6. In 2D the curve is a conic
X2

Why not fit the discriminant curve


to the data directly?

X1

Linear classifiers

A linear discriminant has the form


g(x) = 0
X2

g(x) = w>x + w0

X1

in 2D a linear discriminant is a line, in nD it is a hyperplane


w is the normal to the plane, and w0 the bias
w is known as the weight vector
Linear separability

linearly
separable

not
linearly
separable

Learning separating hyperplanes

Given linearly separable data xi labelled into two categories yi = {0,1} ,


find a weight vector w such that the discriminant function

g(xi ) = w>xi + w0
separates the categories for i = 1,n

how can we find this separating hyperplane ?

The Perceptron Algorithm


Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then w w + sign(g(xi )) xi
Until all the data is correctly classified
For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then w w + sign(g(xi )) xi
Until all the data is correctly classified

before update after update

X2 X2

X1 X1
w w xi
Pn
NB after convergence w = i ixi

6
Perceptron
example 4

-2

-4

-6

-8

-10

-15 -10 -5 0 5 10

if the data is linearly separable, then the algorithm will converge


convergence can be slow
separating line close to training data
we would prefer a larger margin for generalization
8

6
wider margin
classifier 4

-2

-4

-6

-8

-10

-12
-15 -10 -5 0 5 10

how to achieve a wider margin automatically in high dimensions ?

Logistic Regression
ideally we would like to fit a discriminant function using regression
methods similar to those developed for ML and MAP parameter estimation
but there is not the equivalent of model + noise here, since we wish to
map all the spread out features in the same class to one label

to to +

the solution is to transform the parameter space so that

(, ) (0, 1)
The logistic function or sigmoid function
1

0.9

0.8

0.7

1 0.6

( z) = 0.5

1 + ez 0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

g(x) = w>x + w0
Notation: write the equation
more compactly as g(x) = w>x
e.g. in 2D
x1

g(x) = w2 w1 w 0 x2
1

In logistic regression fit a sigmoid function

1
( wtx) = t
1 + e w x
to the data { xi, yi } by minimizing the classification errors yi ( wtxi)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20
Maximum Likelihood Estimation
Assume

p(y = 1|x; w) = (w >x)


p(y = 0|x; w) = 1 (w>x)
write this more compactly as
y (1y)
p(y|x; w ) = (w>x) 1 (w>x)

Then the likelihood (assuming independence) is


n
Y y (1y )
i i
p(y|x; w) (w>xi ) 1 (w>xi )
i
and the negative log likelihood is
n
X
L(w) = yi log (w>xi ) + (1 yi ) log(1 (w>xi))
i

Minimize L(w) using gradient descent [exercise]

X
L(w) = >
yi (w xi ) xj
wj i
which gives the update rule

w w + ((w> xi) yi )xi

Note
this is similar, but not identical, to the perceptron update rule.
there is a unique solution for w
in n-dimensions it is only necessary to learn n+1 parameters. Compare
this with learning normal distributions where learning involves 2n
parameters for the means and n(n+1)/2 for a common covariance matrix
Application: hand written digit recognition

Feature vectors: each image is 28 x 28


pixels. Rearrange as a 784-vector
Training: learn a set of two-class linear
classifiers using logistic regression, e.g.
1 against the rest, or
(0-4) vs (5-9) etc

An alternative is to learn a multi-class


classifier, e.g. using k-nearest neighbours

Example

hand
drawn

5
1 2 3 4 5 6 7 8 9 0
4

3
classification 5
1 2
2

5
3 4
5
5
0
Comparison of discriminant and generative approaches

Discriminant
+ dont have to learn parameters which arent used (e.g. covariance)
+ easy to learn
- no confidence measure
- have to retrain if dimension of feature vectors changed
Generative
+ have confidence measure
+ can use reject option
+ easy to add independent measurements
p( Ck |xA , xB ) p( xA , xB |Ck ) p( Ck )

p( xA |Ck ) p( xB |Ck ) p( Ck )
p( Ck |xA ) p( Ck |xB )

p( Ck )
- expensive to train (because many parameters)

Perceptrons (1969)
Recent progress in Machine Learning
Perceptron Non-examinable
Pn
g(x) = w>x where w = i i xi
X
g(x) = ixi>x
i
Generalize to
X
g( x) = i( xi) t( x)
i
where (x) is a map from x to a higher dimension.
For example, for x = (x1, x2)t

(x) = (x2 ,
1 2 x 2, 2x x )t
1 2

Example

(x1, x2) = (x2 2
1 , x2 , 2x1x2)t

Z= 2x1 x2

0
Y = x22 X = x21

Data is linearly separable in 3D


This means that the problem can still be solved by a linear classifier
Example

Kernels
Generalize further to
X
g(x) = iK (xi, x)
i
where K(x, z) is a (non-linear) Kernel function.
For example
n o
2 2
K(x, z) exp (x z) /(2 )
is a radial basis function kernel, and

K(x, z) (x.z)n
is a polynomial kernel.
Exercise
If n = 2 show that

K (x, z) (x.z)2 = (x)t(z)

Você também pode gostar