Linear Classification

uni vers i ty of copenhagen department of computer s ci ence
Faculty of Science
Linear Classication
Statistical Methods for Machine Learning
Christian Igel
Department of Computer Science
Slide 1/29
Recall I: Gaussian Distribution
Gaussian distribution of a single real-valued variable
with mean R and variance
2
:
N(x| ,
2
) =
1
2
exp
_
1
2
2
(x )
2
_
Multivariate Gaussian distribution of a d-dimensional
real-valued random vector with mean R
d
and covariance
matrix R
dd
:
N(x| , ) =
1
_
(2)
d
det
exp
_
1
2
(x )
T
1
(x )
_
Slide 2/29 Christian Igel Linear Classication
Some reasons for Gaussians
Normal distributions play an important role in modeling for
several reasons:
Gaussians arise in the central limit theorem, which states
that the probability distribution of a sum of n i.i.d. random
variables with nite mean and variance approaches a
Gaussian distribution with increasing n.
Thus, if some outcome depends on several sources of
randomness (and we assume that these sources add up) it
may be well described by a Gaussian.
Among all distributions having some given mean and
variance, the Gaussian distribution has the highest entropy.
This means, if we can or want to x mean and variance and
want to express maximum uncertainty about the outcomes,
then we arrive at the Gaussian distribution.
Recall II: Linear functions
f(x) = x, w + w
0
w
w
0
/w
x, w + w
0
> 0
x
, w + w
0
= 0
x
, w + w
0
< 0
x
x
x,w+w
0
w
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Outline
5 Summary
Decision functions
Classication means assigning an input x X to one class
of a nite set of classes Y = {C
1
, . . . , C
m
}, 2 m < .
One approach is to learn appropriate discrimination
functions
k
: X R, 1 k m, and assign a pattern x
to class y using:
y = h(x) = argmax
k

k
(x)
Binary decision functions
If we have only two classes, we can consider a single function
(x) =
1
(x)
2
(x)
and the hypothesis
h(x) =
_
C
1
if (x) > 0
C
2
otherwise
.
For Y = {1, 1} this is equal to
h(x) = sgn((x)) =
_
1 if (x) > 0
1 otherwise
.
Decision functions and class posteriors
If we know the class posteriors Pr(Y | X) we can perform
optimal classication: a pattern x is assigned to class C
k
with maximum Pr(Y = C
k
| X = x), i.e.,
y = h(x) = argmax
k
Pr(Y = C
k
| X = x)
or in the binary case with Y = {1, 1}
= Pr(Y = C
1
| X = x) Pr(Y = C
2
| X = x)
and y = h(x) = sgn((x)).
Pr(Y = C
k
| X = x) is proportional to the class-conditional
density p(X = x| Y = C
k
) times the class prior Pr(Y = C
k
):
Pr(Y = C
k
| X = x) =
p(X = x| Y = C
k
)Pr(Y = C
k
)
p(X = x)
Gaussian class-conditionals
Lets consider
ln Pr(Y = C
k
| X = x) = ln p(X = x| Y = C
k
)+ln Pr(Y = C
k
)+const
For X = R
d
and Gaussian class-conditionals
p(X = x| Y = C
k
) = N(x|
k
,
k
, Y = C
k
)
we have:
ln N(x|
k
,
k
, Y = C
k
) =
ln
_
1
_
(2)
d
det
k
exp
_
1
2
(x
k
)
T
1
k
(x
k
)
_
_
=
d
2
ln 2
1
2
ln det
k
1
2
(x
k
)
T
1
k
(x
k
) =
d
2
ln 2
1
2
ln det
k
1
2
x
T
1
k
x
1
2
T
k
1
k

k
+x
T
1
k

k
Linear Discriminant Analysis (LDA)
Assume identical covariance matrix for all class-conditionals
ln Pr(Y = C
k
| X = x) const = ln Pr(Y = C
k
)
d
2
ln 2
1
2
ln det
1
2
x
T
1
x
1
2
T
k
k
+x
T
k
(x) = x
T
1
2
T
k
k
+ ln Pr(Y = C
k
)
With S
k
= {(x, y) S | y = C
k
} and
k
= |S
k
| we estimate:
Pr(Y = C
k
) =
k
/

k
=
1
(x,y)S
k
x
=
1
m
m
k=1
(x,y)S
k
(x
k
)(x
k
)
T
Eect of learning the covariance
2 2 6
2
0
2
4
2 2 6
2
0
2
4
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Linear and Quadratic Discriminant Analysis
In LDA the decision boundaries {x|
i
(x) =
j
(x)} between
two classes i and j are linear, and the hypotheses are linear
functions R
d
R.
Modeling independent covariance matrices for the
class-conditionals leads to quadratic discriminant analysis
(QDA) with quadratic decision functions:
k
(x) =
1
2
ln det
k
1
2
x
T
1
k
x
1
2
T
k
1
k

k
+x
T
1
k

k
+ ln Pr(Y = C
k
)
Outline
5 Summary
General linear binary classication
LDA is a linear classication method
Given training examples
S = {(x
1
, y
1
), . . . , (x
, y
)} (R
n
{1, 1})
a binary linear classier assigns x R

n
to one of two
classes {1, 1} by an ane linear decision function
identied by (w, w
0
):
(x) = w, x + w
0
= w
T
x + w
0
=
n
i=1
w
i
x
i
+ w
0
x belongs to the rst class if (x) 0, otherwise to the
second, i.e., the resulting hypothesis is:
h(x) = sgn((x))
Margins I
The functional margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is
i
:= y
i
(w, x
i
+ w
0
) .
The geometric margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is
i
:= y
i
(w, x
i
+ w
0
)/w =
i
/w .
A positive margin implies correct classication.
The margin of a hyperplane (w, w
0
) with respect to a training
set S is min
i
i
. The margin of a training set S is the
maximum geometric margin over all hyperplanes. A hyperplane
realizing this margin is called maximum margin hyperplane.
Margins II
R

R

Outline
5 Summary
Analyzing the Perceptron
Why should we look at the Perceptron?
Linear classiers such as perceptrons are the basis of
technical neurocomputing
Support Vector Machines are basically linear classiers
Basic concepts of learning theory can be explained easily:
Margins
Dual representation
Bounds involving margins and the radius of the ball
containing the data
Perceptron learning algorithm (primal form)
For simplicity, consider hyperplanes with no bias (w
0
= 0),
i.e., H = {h(x) = sgn(w, x) | w R
n
}.
Algorithm 1: Perceptron
Input: separable data {(x
1
, y
1
), . . . } (R
n
{1, 1})
Output: hypothesis h(x) = sgn(w

k
, x)
1 w
0
0; k 0
2 repeat
3 for i = 1, . . . , do
4 if y
i
w
k
, x
i
0 then
5 w
k+1
w
k
+ y
i
x
i
6 k k + 1
7 until no mistake made within for loop
Perceptron learning in pictures
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
1 0.5 0 0.5 1
1
0.5
0
0.5
1
Dual representation
Weight vector of hyperplane computed by online perceptron
algorithm can be written as
w =
i=1
i
y
i
x
i
Function h(x) = sgn((x)) can be written in dual
coordinates
(x) = w, x
=
_

i=1
i
y
i
x
i
, x
_
=
i=1
i
y
i
x
i
, x
Perceptron learning algorithm (dual form)
Algorithm 2: Perceptron (dual form)
Input: separable data {(x
1
, y
1
), . . . } (R
n
{1, 1})
Output: hypothesis h(x) = sgn

_
i=1
i
y
i
x
i
, x
_
1 0
2 repeat
3 for i = 1, . . . , do
4 if y
i
j=1
j
y
j
x
j
, x
i
0 then
5
i

i
+ 1
6 until no mistake made within for loop
Outline
5 Summary
Noviko
Theorem (Noviko)
Let S = {(x
1
, y
1
), . . . , (x
, y
)} be a non-trivial training set

(i.e., containing patterns of both classes), w
0
= 0 =
m
i=1
0x
i
and let
R max
1i
x
i
.
Suppose that there exists w
opt
and > 0 such that
w
opt
= 1 and
y
i
w
opt
, x
i
> 0
for 1 i . Then the number of updates k made by the
online perceptron algorithm on S is at most
_
R
_
2
.
Noviko, sketch of proof I
Let i be the index of the example in update k
w
k+1
2
= w
k
+ y
i
x
i
, w
k
+ y
i
x
i
= w
k
2
+ 2y
i
w
k
, x
i
+x
i
2
w
k
2
+ R
2
(k + 1)R
2
Noviko, sketch of proof II
w
opt
, w
k+1
= w
opt
, w
k
+ y
i
w
opt
, x
i
w
opt
, w
k
+
(k + 1)
k
2
2
w
opt
, w
k
2
w
opt
2
w
k
2
kR
2
k
R
2
2
Outline
5 Summary
Summary
Linear discriminant analysis (LDA)
gives good results in practice,
is easy to use, because it has no hyperparameters,
usually does not overt, but may not capture essential
non-linearities,
it is highly recommended as a baseline method.
Hey, we also now know about
perceptron learning,
margins,
dual representation,
bounds involving margins and the radius of the ball
containing the data.
References:
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004

Linear Classification

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Linear Classification

Enviado por

Direitos autorais:

Formatos disponíveis

uni vers i ty of copenhagen department of computer s ci ence

a binary linear classier assigns x R

Slide 17/29 Christian Igel Linear Classication

Output: hypothesis h(x) = sgn(w

Output: hypothesis h(x) = sgn

)} be a non-trivial training set

Você também pode gostar