Escolar Documentos
Profissional Documentos
Cultura Documentos
Faculty of Science
Linear Classication
Statistical Methods for Machine Learning
Christian Igel
Department of Computer Science
Slide 1/29
uni vers i ty of copenhagen department of computer s ci ence
Recall I: Gaussian Distribution
Gaussian distribution of a single real-valued variable
with mean R and variance
2
:
N(x| ,
2
) =
1
2
exp
_
1
2
2
(x )
2
_
Multivariate Gaussian distribution of a d-dimensional
real-valued random vector with mean R
d
and covariance
matrix R
dd
:
N(x| , ) =
1
_
(2)
d
det
exp
_
1
2
(x )
T
1
(x )
_
Slide 2/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Some reasons for Gaussians
Normal distributions play an important role in modeling for
several reasons:
Gaussians arise in the central limit theorem, which states
that the probability distribution of a sum of n i.i.d. random
variables with nite mean and variance approaches a
Gaussian distribution with increasing n.
Thus, if some outcome depends on several sources of
randomness (and we assume that these sources add up) it
may be well described by a Gaussian.
Among all distributions having some given mean and
variance, the Gaussian distribution has the highest entropy.
This means, if we can or want to x mean and variance and
want to express maximum uncertainty about the outcomes,
then we arrive at the Gaussian distribution.
Slide 3/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Recall II: Linear functions
f(x) = x, w + w
0
w
w
0
/w
x, w + w
0
> 0
x
, w + w
0
= 0
x
, w + w
0
< 0
x
x
x,w+w
0
w
Slide 4/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 5/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 6/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Decision functions
Classication means assigning an input x X to one class
of a nite set of classes Y = {C
1
, . . . , C
m
}, 2 m < .
One approach is to learn appropriate discrimination
functions
k
: X R, 1 k m, and assign a pattern x
to class y using:
y = h(x) = argmax
k
k
(x)
Slide 7/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Binary decision functions
If we have only two classes, we can consider a single function
(x) =
1
(x)
2
(x)
and the hypothesis
h(x) =
_
C
1
if (x) > 0
C
2
otherwise
.
For Y = {1, 1} this is equal to
h(x) = sgn((x)) =
_
1 if (x) > 0
1 otherwise
.
Slide 8/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Decision functions and class posteriors
If we know the class posteriors Pr(Y | X) we can perform
optimal classication: a pattern x is assigned to class C
k
with maximum Pr(Y = C
k
| X = x), i.e.,
y = h(x) = argmax
k
Pr(Y = C
k
| X = x)
or in the binary case with Y = {1, 1}
= Pr(Y = C
1
| X = x) Pr(Y = C
2
| X = x)
and y = h(x) = sgn((x)).
Pr(Y = C
k
| X = x) is proportional to the class-conditional
density p(X = x| Y = C
k
) times the class prior Pr(Y = C
k
):
Pr(Y = C
k
| X = x) =
p(X = x| Y = C
k
)Pr(Y = C
k
)
p(X = x)
Slide 9/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Gaussian class-conditionals
Lets consider
ln Pr(Y = C
k
| X = x) = ln p(X = x| Y = C
k
)+ln Pr(Y = C
k
)+const
For X = R
d
and Gaussian class-conditionals
p(X = x| Y = C
k
) = N(x|
k
,
k
, Y = C
k
)
we have:
ln N(x|
k
,
k
, Y = C
k
) =
ln
_
1
_
(2)
d
det
k
exp
_
1
2
(x
k
)
T
1
k
(x
k
)
_
_
=
d
2
ln 2
1
2
ln det
k
1
2
(x
k
)
T
1
k
(x
k
) =
d
2
ln 2
1
2
ln det
k
1
2
x
T
1
k
x
1
2
T
k
1
k
k
+x
T
1
k
k
Slide 10/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Linear Discriminant Analysis (LDA)
Assume identical covariance matrix for all class-conditionals
ln Pr(Y = C
k
| X = x) const = ln Pr(Y = C
k
)
d
2
ln 2
1
2
ln det
1
2
x
T
1
x
1
2
T
k
k
+x
T
k
(x) = x
T
1
2
T
k
k
+ ln Pr(Y = C
k
)
With S
k
= {(x, y) S | y = C
k
} and
k
= |S
k
| we estimate:
Pr(Y = C
k
) =
k
/
k
=
1
(x,y)S
k
x
=
1
m
m
k=1
(x,y)S
k
(x
k
)(x
k
)
T
Slide 11/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Eect of learning the covariance
2 2 6
2
0
2
4
2 2 6
2
0
2
4
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Slide 12/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Linear and Quadratic Discriminant Analysis
In LDA the decision boundaries {x|
i
(x) =
j
(x)} between
two classes i and j are linear, and the hypotheses are linear
functions R
d
R.
Modeling independent covariance matrices for the
class-conditionals leads to quadratic discriminant analysis
(QDA) with quadratic decision functions:
k
(x) =
1
2
ln det
k
1
2
x
T
1
k
x
1
2
T
k
1
k
k
+x
T
1
k
k
+ ln Pr(Y = C
k
)
Slide 13/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 14/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
General linear binary classication
LDA is a linear classication method
Given training examples
S = {(x
1
, y
1
), . . . , (x
, y
)} (R
n
{1, 1})
i=1
w
i
x
i
+ w
0
x belongs to the rst class if (x) 0, otherwise to the
second, i.e., the resulting hypothesis is:
h(x) = sgn((x))
Slide 15/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Margins I
The functional margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is
i
:= y
i
(w, x
i
+ w
0
) .
The geometric margin of an example (x
i
, y
i
) with respect to a
hyperplane (w, w
0
) is
i
:= y
i
(w, x
i
+ w
0
)/w =
i
/w .
A positive margin implies correct classication.
The margin of a hyperplane (w, w
0
) with respect to a training
set S is min
i
i
. The margin of a training set S is the
maximum geometric margin over all hyperplanes. A hyperplane
realizing this margin is called maximum margin hyperplane.
Slide 16/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Margins II
R
R
i=1
i
y
i
x
i
Function h(x) = sgn((x)) can be written in dual
coordinates
(x) = w, x
=
_
i=1
i
y
i
x
i
, x
_
=
i=1
i
y
i
x
i
, x
Slide 22/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Perceptron learning algorithm (dual form)
Algorithm 2: Perceptron (dual form)
Input: separable data {(x
1
, y
1
), . . . } (R
n
{1, 1})
i=1
i
y
i
x
i
, x
_
1 0
2 repeat
3 for i = 1, . . . , do
4 if y
i
j=1
j
y
j
x
j
, x
i
0 then
5
i
i
+ 1
6 until no mistake made within for loop
Slide 23/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 24/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko
Theorem (Noviko)
Let S = {(x
1
, y
1
), . . . , (x
, y
m
i=1
0x
i
and let
R max
1i
x
i
.
Suppose that there exists w
opt
and > 0 such that
w
opt
= 1 and
y
i
w
opt
, x
i
> 0
for 1 i . Then the number of updates k made by the
online perceptron algorithm on S is at most
_
R
_
2
.
Slide 25/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko, sketch of proof I
Let i be the index of the example in update k
w
k+1
2
= w
k
+ y
i
x
i
, w
k
+ y
i
x
i
= w
k
2
+ 2y
i
w
k
, x
i
+x
i
2
w
k
2
+ R
2
(k + 1)R
2
Slide 26/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Noviko, sketch of proof II
w
opt
, w
k+1
= w
opt
, w
k
+ y
i
w
opt
, x
i
w
opt
, w
k
+
(k + 1)
k
2
2
w
opt
, w
k
2
w
opt
2
w
k
2
kR
2
k
R
2
2
Slide 27/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Outline
1 Linear Discriminant Analysis
2 Linear Classication and Margins
3 Perceptron Learning
4 Convergence of Perceptron Learning
5 Summary
Slide 28/29 Christian Igel Linear Classication
uni vers i ty of copenhagen department of computer s ci ence
Summary
Linear discriminant analysis (LDA)
gives good results in practice,
is easy to use, because it has no hyperparameters,
usually does not overt, but may not capture essential
non-linearities,
it is highly recommended as a baseline method.
Hey, we also now know about
perceptron learning,
margins,
dual representation,
bounds involving margins and the radius of the ball
containing the data.
References:
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006
Slide 29/29 Christian Igel Linear Classication