Você está na página 1de 29

uni vers i ty of copenhagen department of computer s ci ence

Faculty of Science
Linear Classication
Statistical Methods for Machine Learning
Christian Igel
Department of Computer Science
Slide 1/29
uni vers i ty of copenhagen department of computer s ci ence
Recall I: Gaussian Distribution
Gaussian distribution of a single real-valued variable
with mean R and variance
2
:
N(x| ,
2
) =
1

2
exp
_

1
2
2
(x )
2
_
Multivariate Gaussian distribution of a d-dimensional
real-valued random vector with mean R
d
and covariance
matrix R
dd
:
N(x| , ) =
1
_
(2)
d
det
exp
_

1
2
(x )
T

1
(x )
_
Slide 2/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Some reasons for Gaussians
Normal distributions play an important role in modeling for
several reasons:
Gaussians arise in the central limit theorem, which states
that the probability distribution of a sum of n i.i.d. random
variables with nite mean and variance approaches a
Gaussian distribution with increasing n.
Thus, if some outcome depends on several sources of
randomness (and we assume that these sources add up) it
may be well described by a Gaussian.
Among all distributions having some given mean and
variance, the Gaussian distribution has the highest entropy.
This means, if we can or want to x mean and variance and
want to express maximum uncertainty about the outcomes,
then we arrive at the Gaussian distribution.
Slide 3/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Recall II: Linear functions
f(x) = x, w + w
0
w
w
0
/w
x, w + w
0
> 0
x

, w + w
0
= 0
x

, w + w
0
< 0
x

x
x,w+w
0
w
Slide 4/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 5/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 6/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Decision functions
Classication means assigning an input x X to one class
of a nite set of classes Y = {C
1
, . . . , C
m
}, 2 m < .
One approach is to learn appropriate discrimination
functions
k
: X R, 1 k m, and assign a pattern x
to class y using:
y = h(x) = argmax
k

k
(x)
Slide 7/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Binary decision functions
If we have only two classes, we can consider a single function
(x) =
1
(x)
2
(x)
and the hypothesis
h(x) =
_
C
1
if (x) > 0
C
2
otherwise
.
For Y = {1, 1} this is equal to
h(x) = sgn((x)) =
_
1 if (x) > 0
1 otherwise
.
Slide 8/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Decision functions and class posteriors
If we know the class posteriors Pr(Y | X) we can perform
optimal classication: a pattern x is assigned to class C
k
with maximum Pr(Y = C
k
| X = x), i.e.,
y = h(x) = argmax
k
Pr(Y = C
k
| X = x)
or in the binary case with Y = {1, 1}
= Pr(Y = C
1
| X = x) Pr(Y = C
2
| X = x)
and y = h(x) = sgn((x)).
Pr(Y = C
k
| X = x) is proportional to the class-conditional
density p(X = x| Y = C
k
) times the class prior Pr(Y = C
k
):
Pr(Y = C
k
| X = x) =
p(X = x| Y = C
k
)Pr(Y = C
k
)
p(X = x)
Slide 9/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Gaussian class-conditionals
Lets consider
ln Pr(Y = C
k
| X = x) = ln p(X = x| Y = C
k
)+ln Pr(Y = C
k
)+const
For X = R
d
and Gaussian class-conditionals
p(X = x| Y = C
k
) = N(x|
k
,
k
, Y = C
k
)
we have:
ln N(x|
k
,
k
, Y = C
k
) =
ln
_
1
_
(2)
d
det
k
exp
_

1
2
(x
k
)
T

1
k
(x
k
)
_
_
=

d
2
ln 2
1
2
ln det
k

1
2
(x
k
)
T

1
k
(x
k
) =

d
2
ln 2
1
2
ln det
k

1
2
x
T

1
k
x
1
2

T
k

1
k

k
+x
T

1
k

k
Slide 10/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Linear Discriminant Analysis (LDA)
Assume identical covariance matrix for all class-conditionals
ln Pr(Y = C
k
| X = x) const = ln Pr(Y = C
k
)

d
2
ln 2
1
2
ln det
1
2
x
T

1
x
1
2

T
k

k
+x
T

k
(x) = x
T

1
2

T
k

k
+ ln Pr(Y = C
k
)
With S
k
= {(x, y) S | y = C
k
} and
k
= |S
k
| we estimate:

Pr(Y = C
k
) =
k
/

k
=
1

(x,y)S
k
x

=
1
m
m

k=1

(x,y)S
k
(x
k
)(x
k
)
T
Slide 11/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Eect of learning the covariance
2 2 6
2
0
2
4
2 2 6
2
0
2
4
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Slide 12/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Linear and Quadratic Discriminant Analysis
In LDA the decision boundaries {x|
i
(x) =
j
(x)} between
two classes i and j are linear, and the hypotheses are linear
functions R
d
R.
Modeling independent covariance matrices for the
class-conditionals leads to quadratic discriminant analysis
(QDA) with quadratic decision functions:

k
(x) =
1
2
ln det
k

1
2
x
T

1
k
x
1
2

T
k

1
k

k
+x
T

1
k

k
+ ln Pr(Y = C
k
)
Slide 13/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 14/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
General linear binary classication
LDA is a linear classication method
Given training examples
S = {(x
1
, y
1
), . . . , (x

, y

)} (R
n
{1, 1})

a binary linear classier assigns x R


n
to one of two
classes {1, 1} by an ane linear decision function
identied by (w, w
0
):
(x) = w, x + w
0
= w
T
x + w
0
=
n

i=1
w
i
x
i
+ w
0
x belongs to the rst class if (x) 0, otherwise to the
second, i.e., the resulting hypothesis is:
h(x) = sgn((x))
Slide 15/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Margins I
The functional margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is

i
:= y
i
(w, x
i
+ w
0
) .
The geometric margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is

i
:= y
i
(w, x
i
+ w
0
)/w =
i
/w .
A positive margin implies correct classication.
The margin of a hyperplane (w, w
0
) with respect to a training
set S is min
i

i
. The margin of a training set S is the
maximum geometric margin over all hyperplanes. A hyperplane
realizing this margin is called maximum margin hyperplane.
Slide 16/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Margins II
R

R

Slide 17/29 Christian Igel Linear Classication


uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 18/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Analyzing the Perceptron
Why should we look at the Perceptron?
Linear classiers such as perceptrons are the basis of
technical neurocomputing
Support Vector Machines are basically linear classiers
Basic concepts of learning theory can be explained easily:
Margins
Dual representation
Bounds involving margins and the radius of the ball
containing the data
Slide 19/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Perceptron learning algorithm (primal form)
For simplicity, consider hyperplanes with no bias (w
0
= 0),
i.e., H = {h(x) = sgn(w, x) | w R
n
}.
Algorithm 1: Perceptron
Input: separable data {(x
1
, y
1
), . . . } (R
n
{1, 1})

Output: hypothesis h(x) = sgn(w


k
, x)
1 w
0
0; k 0
2 repeat
3 for i = 1, . . . , do
4 if y
i
w
k
, x
i
0 then
5 w
k+1
w
k
+ y
i
x
i
6 k k + 1
7 until no mistake made within for loop
Slide 20/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Perceptron learning in pictures
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Slide 21/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Dual representation
Weight vector of hyperplane computed by online perceptron
algorithm can be written as
w =

i=1

i
y
i
x
i
Function h(x) = sgn((x)) can be written in dual
coordinates
(x) = w, x
=
_

i=1

i
y
i
x
i
, x
_
=

i=1

i
y
i
x
i
, x
Slide 22/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Perceptron learning algorithm (dual form)
Algorithm 2: Perceptron (dual form)
Input: separable data {(x
1
, y
1
), . . . } (R
n
{1, 1})

Output: hypothesis h(x) = sgn


_

i=1

i
y
i
x
i
, x
_
1 0
2 repeat
3 for i = 1, . . . , do
4 if y
i

j=1

j
y
j
x
j
, x
i
0 then
5
i

i
+ 1
6 until no mistake made within for loop
Slide 23/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 24/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko
Theorem (Noviko)
Let S = {(x
1
, y
1
), . . . , (x

, y

)} be a non-trivial training set


(i.e., containing patterns of both classes), w
0
= 0 =

m
i=1
0x
i
and let
R max
1i
x
i
.
Suppose that there exists w
opt
and > 0 such that
w
opt
= 1 and
y
i
w
opt
, x
i
> 0
for 1 i . Then the number of updates k made by the
online perceptron algorithm on S is at most
_
R

_
2
.
Slide 25/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko, sketch of proof I
Let i be the index of the example in update k
w
k+1

2
= w
k
+ y
i
x
i
, w
k
+ y
i
x
i

= w
k

2
+ 2y
i
w
k
, x
i
+x
i

2
w
k

2
+ R
2
(k + 1)R
2
Slide 26/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko, sketch of proof II
w
opt
, w
k+1
= w
opt
, w
k
+ y
i
w
opt
, x
i

w
opt
, w
k
+
(k + 1)
k
2

2
w
opt
, w
k

2
w
opt

2
w
k

2
kR
2
k
R
2

2
Slide 27/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 28/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Summary
Linear discriminant analysis (LDA)
gives good results in practice,
is easy to use, because it has no hyperparameters,
usually does not overt, but may not capture essential
non-linearities,
it is highly recommended as a baseline method.
Hey, we also now know about
perceptron learning,
margins,
dual representation,
bounds involving margins and the radius of the ball
containing the data.
References:
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Slide 29/29 Christian Igel Linear Classication

Você também pode gostar