Le

Lectures 5 & 6: Classifiers
Hilary Term 2007 A. Zisserman
Bayesian Decision Theory

Bayes decision rule
Loss functions
Likelihood ratio test
Classifiers and Decision Surfaces

Discriminant function
Normal distributions
Linear Classifiers
The Perceptron
Logistic Regression
Decision Theory
Suppose we wish to make measurements on a medical image and
classify it as showing evidence of cancer or not
C1 cancer
image x
image decision C2 no cancer
processing rule
measurement
and we want to base this decision on the learnt joint distribution
p(x, Ci ) = p(x|Ci)p(Ci )
How do we make the best decision?
Classification
Assign input vector x to one of two or more classes Ck
Any decision rule divides input space into decision regions separated
by decision boundaries
Example: two class decision depending on a 2D vector measurement
Also, would like a confidence measure (how sure are we that the
input belongs to the chosen category?)
Decision Boundary for average error
Consider a two class decision depending
on a scalar variable x x0 x
^
p(x,C1)
Z + p(x,C2)
p(error) = p(error, x) dx

Z Z
= p(x, C2) dx + p(x, C1) dx x
R1 R2
R1 R2
minimize number of misclassifications if the decision boundary is at x0
Bayes Decision rule

Assign x to the class Ci for which p(x, Ci) is largest
since p(x, Ci) = p(Ci|x) p(x) this is equivalent to

Assign x to the class Ci for which p( Ci | x ) is largest
Bayes error
A classifier is a mapping from a vector x
to class labels {C1, C2}
x0 x
^
p(x,C1)
Z + p(x,C2)
p(error) = p(error, x) dx

Z Z x
= p(x, C2) dx + p(x, C1) dx
R1 R2
R1 R2
Z Z
= p(C2|x)p(x) dx + p(C1|x)p(x) dx
R1 R2
The Bayes error is the probability of misclassification

Example: Iris recognition
How Iris Recognition Works, John Daugman

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR
VIDEO TECHNOLOGY, VOL. 14, NO. 1, JANUARY 2004
Posteriors
Assign x to the class Ci for which p( Ci | x ) is largest
5
p(C1|x) p(C |x)
p(x|C2) 2
4 1
posterior probabilities
class densities
0.8
3
0.6 sum
2
p(x|C1) 0.4 to 1
1 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
p(C1|x) + p(C2|x) = 1,
so p(C2|x) = 1 p(C1|x)
i.e. class i if p(Ci |x) > 0.5
Reject option
avoid making decisions if unsure
p(C |x) p(C2|x)

1
1
posterior probabilities
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x
reject region
reject if posterior probability p(Ci|x) <

Example skin detection in video
Objective: label skin pixels (as a means to detect humans)

Two stages:
1. Training: learn likelihood for pixel colour, given skin and non-skin
pixels
2. Testing: classify a new image into skin regions
training image training skin pixel mask masked pixels
Choice of colour space
- chromaticity color space: r=R/(R+G+B), g=G/(R+G+B)

- invariant to scaling of R,G,B, plus 2D for visualisation
chromaticity color space

1
0.9
0.8
0.7
0.6
g=G/(R+G+B)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)
skin pixels in chromaticity space
1
0.9
0.8
0.7
0.6
g=G/(R+G+B)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)
Represent likelihood as Normal Distribution

1 1
N (x|, ) = n/2 1/2
exp (x )>1(x )
(2 ) || 2
p(x|background) p(x|skin)
1 1
120
1200
0.9 0.9
100 0.8 1000

0.8
0.7 0.7
80 800
0.6
g=G/(R+G+B)
0.6
g=G/(R+G+B)
0.5
0.5 60 600
0.4
0.4
40 0.3 400
0.3
0.2
0.2 200
20
0.1
0.1
0 0
0 0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 r=R/(R+G+B)
r=R/(R+G+B)
Gaussian fitted to background pixels Gaussian fitted to skin pixels

contours of p(x|skin/background)
0.45
0.4
g=G/(R+G+B)
0.35
0.3
0.25
0.2
0.3 0.35 0.4 0.45 0.5 0.55

r=R/(R+G+B)
contours of two Gaussians 3D view of two Gaussians

vertical axis is likelihood
Posterior probability of skin given pixel colour

Posterior probability of skin is dened by Bayes rule:
p(x|skin)P (skin)
P (skin|x) =
p(x)
where
p(x) = p(x|skin)P (skin) + p(x|background)P (background)

i.e. the marginal pdf of x
Assume equal prior probabilities, i.e. probability

of a pixel being skin is 0.5.
NB: the posterior depends on both

foreground and background likelihoods
i.e. it involves both distributions
Assess performance on training image
P(x|background) 120
100
80
input
60
40
20
P(x|skin)
1200
1000
800
600
400
200
P(x|skin) 1
P(skin|x)
1200
0.9
1000 0.8
0.7
800
0.6
600 0.5
0.4
400
0.3
0.2
200
0.1
0 0
likelihood posterior
posterior depends on likelihoods (Gaussians) of both classes

1
P(skin|x) P(skin|x)>0.5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Test data
p(x|background)
p(x|skin)
p(skin|x)
p(skin|x)>0.5
Test performance on other frames
Receiver Operator Characteristic (ROC) Curve

In many algorithms there is a threshold that affects performance
1
true
positives threshold
decreasing
worse
performance
e.g. 0 false positives 1

true positive: skin pixel classified as skin
false positive: background pixel classified as skin
Loss function revisited
Consider again the cancer diagnosis example. The consequences for an
incorrect classification vary for the following cases:
False positive: does not have cancer, but is classified as having it
> distress, plus unnecessary further investigation
False negative: does have cancer, but is classified as not having it

> no treatment, premature death
The two other cases are true positive and true negative.
Because the consequences of a false negative far outweigh the others,

rather than simply minimize the number of mistakes, a loss function can be
minimized.
X
Risk R(Ci|x) = Lij p(Cj |x)
j
cancer normal
0 1 cancer
Loss matrix Lij = classification
1000 0 normal
truth
true +ve false +ve
false -ve true -ve

Bayes Risk
The class conditional risk of an action is
measurement
X
R(ai|x) = L(ai |Cj )p(Cj |x)
j
action
loss incurred if action i
taken and true state is j
Bayes decision rule: select the action for which R(ai | x) is minimum
Mininimize Bayes risk a i = arg min R(ai|x)

ai
This decision minimizes the expected loss
Likelihood ratio
Two category classification with loss function
Conditional risk
R(a1|x) = L11p(C1|x) + L12p(C2|x)

R(a2|x) = L21p(C1|x) + L22p(C2|x)
Thus for minimum risk, decide C1 if
L 11p(C1|x) + L12p(C2|x) < L21 p(C1|x) + L22p(C2|x)

p(C2|x)(L12 L22) < p(C1|x)(L21 L11)
p(x|C2)p(C2)(L12 L22) < p(x|C1)p(C1)(L21 L11) Bayes
Assuming L21 L11 > 0, then decide C1 if
p(x|C1) p(C2)(L22 L12)
>
p(x|C2) p(C1)(L11 L21)
i.e. likelihood ratio exceeds a threshold that is independent of x
Discriminant functions
A two category classifier can often be written in the
form
(
> 0 assign x to C1
g(x)
< 0 assign x to C2
where g(x) is a discriminant function, and
g(x) = 0
g(x) = 0
is a discriminant surface.
C2 C1
In 2D g(x) = 0 is a set of curves.
Posterior probability of skin given pixel colour

Posterior probability of skin is dened by Bayes rule:
p(x|skin)P (skin)
P (skin|x) =
p(x)
where
p(x) = p(x|skin)P (skin) + p(x|background)P (background)

i.e. the marginal pdf of x
Assume equal prior probabilities, i.e. probability

of a pixel being skin is 0.5.
Example
In the minimum average error classier, the assignment rule is: decide C1
if the posterior p(C1|x) > p(C2|x).
The equivalent discriminant function is
g(x) = p(C1|x) p(C2|x )

or
p(C1|x)
g(x) = ln
p(C2|x)
Note, these two functions are not equal, but the decision boundaries are
the same.
Developing this further

p(C1|x )
g(x) = ln
p(C2|x )
p(x|C1) p(C1)
= ln + ln
p(x|C2) p(C2)
Decision surfaces for Normal distributions

Suppose that the likelihoods are Normal:
p(x|C1) N (1, 1) p(x|C2) N (2, 2)
Then
p(x|C1) p(C1)
g(x) = ln + ln
p(x|C2) p(C2)
p(C1)
= ln p(x|C1) ln p(x|C2) + ln
p(C2)
(x 1)>1 > 1
1 (x 1) + (x 2) 2 (x 2) + c
0
where c0 = ln p(C1) 1 ln | | + 1 ln | |.
p(C )2 2 1 2 2
Case 1: i = 2I
g(x) = (x 1)>(x 1) + (x 2)> (x 2) + 2 2c
Example in 2D
! ! " #
0 1 1 0
1 = 2 = i =
0 0 0 1
g(x) = (x2 + y2) + (x 1)2 + y2 + c

= 2x + c + 1
g(x) = 0
This is a line at x = (c+1)/2
if the priors are equal then c = 0
1
in nD the discriminant surface is a
hyperplane
(2 1) .x = c00
Case 2: i = (covariance matrices are equal)
The discriminant surface
g(x) = (x 1)>1(x 1) + (x 2)>1(x 2) + c0

is also a hyperplane. Why?
Case 3: i = arbitrary
The discriminant surface
g(x) = (x 1)>1 > 1

1 (x 1) + (x 2) 2 (x 2) + c
0
is a conic (2D) or quadric (nD).

e.g. in 3D
The surface can be a hyperboloid, i.e. it need not be closed

Discriminative Methods
So far, we have carried out the following steps in order to
compute a discriminant surface:
1. Measure feature vectors (e.g. in 2D for skin colour) for each class
from training data
2. Learn likelihood pdfs for each class (and priors)
3. Represent likelihoods by fitting Gaussians
4. Compute the posteriors p(Ci | x )
5. Compute the discriminant surface (from the likelihood Gaussians)
6. In 2D the curve is a conic
X2
Why not fit the discriminant curve

to the data directly?
X1
Linear classifiers
A linear discriminant has the form

g(x) = 0
X2
g(x) = w>x + w0
X1
in 2D a linear discriminant is a line, in nD it is a hyperplane

w is the normal to the plane, and w0 the bias
w is known as the weight vector
Linear separability
linearly
separable
not
linearly
separable
Learning separating hyperplanes
Given linearly separable data xi labelled into two categories yi = {0,1} ,

find a weight vector w such that the discriminant function
g(xi ) = w>xi + w0
separates the categories for i = 1,n
how can we find this separating hyperplane ?
The Perceptron Algorithm

Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then w w + sign(g(xi )) xi
Until all the data is correctly classified
For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then w w + sign(g(xi )) xi
Until all the data is correctly classified
before update after update
X2 X2
X1 X1
w w xi
Pn
NB after convergence w = i ixi
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
if the data is linearly separable, then the algorithm will converge

convergence can be slow
separating line close to training data
we would prefer a larger margin for generalization
8
6
wider margin
classifier 4
-2
-4
-6
-8
-10
-12
-15 -10 -5 0 5 10
how to achieve a wider margin automatically in high dimensions ?
Logistic Regression
ideally we would like to fit a discriminant function using regression
methods similar to those developed for ML and MAP parameter estimation
but there is not the equivalent of model + noise here, since we wish to
map all the spread out features in the same class to one label
to to +
the solution is to transform the parameter space so that
(, ) (0, 1)
The logistic function or sigmoid function
1
0.9
0.8
0.7
1 0.6
( z) = 0.5
1 + ez 0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
g(x) = w>x + w0
Notation: write the equation
more compactly as g(x) = w>x
e.g. in 2D
x1

g(x) = w2 w1 w 0 x2
1
In logistic regression fit a sigmoid function
1
( wtx) = t
1 + e w x
to the data { xi, yi } by minimizing the classification errors yi ( wtxi)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
Maximum Likelihood Estimation
Assume
p(y = 1|x; w) = (w >x)

p(y = 0|x; w) = 1 (w>x)
write this more compactly as
y (1y)
p(y|x; w ) = (w>x) 1 (w>x)
Then the likelihood (assuming independence) is

n
Y y (1y )
i i
p(y|x; w) (w>xi ) 1 (w>xi )
i
and the negative log likelihood is
n
X
L(w) = yi log (w>xi ) + (1 yi ) log(1 (w>xi))
i
Minimize L(w) using gradient descent [exercise]
X
L(w) = >
yi (w xi ) xj
wj i
which gives the update rule
w w + ((w> xi) yi )xi
Note
this is similar, but not identical, to the perceptron update rule.
there is a unique solution for w
in n-dimensions it is only necessary to learn n+1 parameters. Compare
this with learning normal distributions where learning involves 2n
parameters for the means and n(n+1)/2 for a common covariance matrix
Application: hand written digit recognition
Feature vectors: each image is 28 x 28

pixels. Rearrange as a 784-vector
Training: learn a set of two-class linear
classifiers using logistic regression, e.g.
1 against the rest, or
(0-4) vs (5-9) etc
An alternative is to learn a multi-class

classifier, e.g. using k-nearest neighbours
Example
hand
drawn
5
1 2 3 4 5 6 7 8 9 0
4
3
classification 5
1 2
2
5
3 4
5
5
0
Comparison of discriminant and generative approaches
Discriminant
+ dont have to learn parameters which arent used (e.g. covariance)
+ easy to learn
- no confidence measure
- have to retrain if dimension of feature vectors changed
Generative
+ have confidence measure
+ can use reject option
+ easy to add independent measurements
p( Ck |xA , xB ) p( xA , xB |Ck ) p( Ck )
p( xA |Ck ) p( xB |Ck ) p( Ck )
p( Ck |xA ) p( Ck |xB )

p( Ck )
- expensive to train (because many parameters)
Perceptrons (1969)
Recent progress in Machine Learning
Perceptron Non-examinable
Pn
g(x) = w>x where w = i i xi
X
g(x) = ixi>x
i
Generalize to
X
g( x) = i( xi) t( x)
i
where (x) is a map from x to a higher dimension.
For example, for x = (x1, x2)t

(x) = (x2 ,
1 2 x 2, 2x x )t
1 2
Example

(x1, x2) = (x2 2
1 , x2 , 2x1x2)t

Z= 2x1 x2
0
Y = x22 X = x21
Data is linearly separable in 3D

This means that the problem can still be solved by a linear classifier
Example
Kernels
Generalize further to
X
g(x) = iK (xi, x)
i
where K(x, z) is a (non-linear) Kernel function.
For example
n o
2 2
K(x, z) exp (x z) /(2 )
is a radial basis function kernel, and
K(x, z) (x.z)n
is a polynomial kernel.
Exercise
If n = 2 show that
K (x, z) (x.z)2 = (x)t(z)

Le

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Le

Enviado por

Direitos autorais:

Formatos disponíveis

Lectures 5 & 6: Classifiers

Hilary Term 2007 A. Zisserman

Bayesian Decision Theory

Classifiers and Decision Surfaces

and we want to base this decision on the learnt joint distribution

Example: two class decision depending on a 2D vector measurement

minimize number of misclassifications if the decision boundary is at x0

Bayes Decision rule

since p(x, Ci) = p(Ci|x) p(x) this is equivalent to

The Bayes error is the probability of misclassification

How Iris Recognition Works, John Daugman

p(C |x) p(C2|x)

reject if posterior probability p(Ci|x) <

Objective: label skin pixels (as a means to detect humans)

training image training skin pixel mask masked pixels

Choice of colour space

- chromaticity color space: r=R/(R+G+B), g=G/(R+G+B)

chromaticity color space

Represent likelihood as Normal Distribution

100 0.8 1000

Gaussian fitted to background pixels Gaussian fitted to skin pixels

0.3 0.35 0.4 0.45 0.5 0.55

contours of two Gaussians 3D view of two Gaussians

Posterior probability of skin given pixel colour

p(x) = p(x|skin)P (skin) + p(x|background)P (background)

Assume equal prior probabilities, i.e. probability

NB: the posterior depends on both

posterior depends on likelihoods (Gaussians) of both classes

Receiver Operator Characteristic (ROC) Curve

e.g. 0 false positives 1

False negative: does have cancer, but is classified as not having it

Because the consequences of a false negative far outweigh the others,

true +ve false +ve

false -ve true -ve

Mininimize Bayes risk a i = arg min R(ai|x)

This decision minimizes the expected loss

R(a1|x) = L11p(C1|x) + L12p(C2|x)

Thus for minimum risk, decide C1 if

L 11p(C1|x) + L12p(C2|x) < L21 p(C1|x) + L22p(C2|x)

where g(x) is a discriminant function, and

Posterior probability of skin given pixel colour

p(x) = p(x|skin)P (skin) + p(x|background)P (background)

Assume equal prior probabilities, i.e. probability

The equivalent discriminant function is

g(x) = p(C1|x) p(C2|x )

Developing this further

Decision surfaces for Normal distributions

p(x|C1) N (1, 1) p(x|C2) N (2, 2)

g(x) = (x 1)>(x 1) + (x 2)> (x 2) + 2 2c

g(x) = (x2 + y2) + (x 1)2 + y2 + c

g(x) = (x 1)>1(x 1) + (x 2)>1(x 2) + c0

g(x) = (x 1)>1 > 1

is a conic (2D) or quadric (nD).

The surface can be a hyperboloid, i.e. it need not be closed

Why not fit the discriminant curve

A linear discriminant has the form

in 2D a linear discriminant is a line, in nD it is a hyperplane

Learning separating hyperplanes

Given linearly separable data xi labelled into two categories yi = {0,1} ,

how can we find this separating hyperplane ?

The Perceptron Algorithm

before update after update

if the data is linearly separable, then the algorithm will converge