SVM

Support Vector Machines
One of the most useful tools in machine learning

Optimal margin classifier
Kernels
Allow for the use of high dimensional feature vectors
Three blanket assumptions to start:

Data is separable
The discriminant function (prediction function) is linear
Binary classification {-1, 1}
1
The discriminant function

Linear discriminant function
(x) = wT x + b
w is called the weight vector
(x) 0 x is in class 1
b is called the bias

(x) < 0 x is in class 2
Observation 1: w is orthogonal to the decision boundary

Proof: consider two points on the decision boundary
(x1 ) = wT x1 + b = 0
(x ) = w x + b = 0w (x x ) = 0
2
x1
x2
(or in the opposite direction)

2
Distance from a point to the decision boundary

Given a point x, its distance to the line (hyperplane) defined by
the discrimant function is given by
| wT x + b |
| (x) |
=
w
w
We consider the minimum distance for

(x i )
all training points min i=1,...,m
w
And we want to find the classifier that

maximes this quantity: the gap
We impose the constraint that all
training points are correctly classified: {-1,1}
3
SVM formulation
Therefore we require y i (x i ) > 0 for all i
The distance from any point x i to the decision boundary is therefore
y i (x i )
y i (wT x i + b)
=
w
w
The problem of finding the maximum margin classifier is

y i (wT x i + b)
arg max w.b {min i=1,...m
}
This min-max problem is difficult to solve,

but it can be recast as a quadratic optimization
problem, as follows
4
SVM formulation
Let x j be a closest point to the decision boundary
then we can assume (!)
y j (wT x j + b) = 1
Then for all data points
y i (wT x i + b) 1
The optimization problem

y i (wT x i + b)
arg max w.b {min i=1,...m
}
Can be rewritten as
1
arg max w,b
subject to y i (wT x i + b) 1 0i = 1,..., m '
w
5
Quadratic optimization formulation

The final step is to observe the equivalence
1
arg max w,b
subject to y i (wT x i + b) 1 0
w
arg min w,b w2 subject to y i (wT x i + b) 1 0
This is a standard problem that can be solved with a quadratic

programming solver (interior point or active set). But if we stop
here we will fail to gain one of the most important insights in
machine learning. and will not have a powerful approach
The key is to look at the dual problem
6
Optimality conditions
Consider the optimization problem
min x f (x) subject to ci (x) 0,i = 1..., m
m
L(x, ) = f (x) i ci (x)
Define the Lagrangian:
i=1
Any solution of the nonlinear optimization problem satisfies

m
x L(x* , ) = f (x* ) i ci (x* ) = 0

i=1
stationarity
i ci (x* ) = 0 complementarity
i 0,ci (x* ) 0
feasibility
By complementarity, all multipliers i corresponding to

inactive constraints, (i.e.ci (x* ) > 0) are zero
7
For our problem

arg min w,b w2 subject to y i (wT x i + b) 1 0i = 1,..., m
The Lagrangian is
m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
Stationarity of the Lagrangian
m
w L(w,b, ) = w i y i x i = 0
m
i=1
i i
w = i yi x i
b L(w,b, ) = y = 0
i=1
i=1
Once the multipliers are computed we obtain w
m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
m
w = y x
i
i=1
y
i
=0
i=1
Substituting
m
m
m
1
2
i i
i
L(w,b, )= w y b + i y i (wT x i )
2
i=1
i=1
i=1
m
m
1
2
i
= w 0 + i y i (wT x i )
2
i=1
i=1
m
m
m
1
2
i
i i
= w + y ( j y j x j )T x i )
2
i=1
i=1
j=1
m
1
i
= i j y i y j (x i )T x j
2 i j
i=1
Dual Problem
m
1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i
subject to i 0i = 1,...m
m
i y i = 0
i=1
By complementarity
i [y i (wT x i + b) 1] = 0
for all training points that are not support vectors i = 0

m
After solving Dual for w = i y i x i Sis the support

iS
10
Bias
For any support vector y s (wT x s + b) = 1
(wT x s + b) = y s
Since y s = 1 or 1 (y s )2 = 1
We can solve for b

For a more stable computation, we use all support vectors
and take the average
1
j
T j
b =
[y
w
x ]
| S | jS
1
=
[y j i y i (x i )T x j ]
| S | jS
iS
1
j
i i
i
j
=
[y
y
K
(x
,
x
)]
| S | jS
iS
11
Making predictions
Suppose that we have computed our prediction function.
We are now given an (unseen) point x and compute
m
w x + b = [ y x ] x + b
T
i T
i=1
m
since w = i y i x i
i=1
wT x + b = i y i K (x i , x) + b
i=1
Where we have defined K (x, y) = x T y
In summary, after computing the multipliers, most of which

are zero, to make a prediction we need to evaluate the Kernel
function for every support point, and this is the dominant cost
12
Kernels
So far we have assumed that the feature vector contains simple
measurements
But we may have greater
x
1
mathematical flexibility if
x = x2
we also consider (.e.g) products
x3
x1
x x
1 2
(x) = 2
x3
x1 x2
Go back and in the previous

derivation replace x by (x)
The only change is:

K(x,y)= (x)T (y)
Feature mapping: maps attributes (original inputs) to features

13
Economies
Given we could compute (x)and (y)
and compute K (x, y) by taking their inner product
Important observation: often K (x, z) is much cheaper
to compute than (x) itself!
Or more precisely, we start we definitions of Kernels
that are inexpensive to compute and are mathematically
rich, and from there deduce what the feature vector is
14
Soft Margin Classifier

So far assumed that data are linearly separable in the
feature space (x)
The SVM algorithm will give exact separation in the
original input space x, even though decision boundary
may be nonlinear
In practice class conditional distributions may overlap;
exact separation can lead to poor generalization
Example
15
Bishop, Figure 7.2
Binary classification in 2 dimensions. Results from an SVM using a

Gaussian kernel.
Lines are contours of y(x); decision boundary, and margin boundaries
16
Slacks
Allow SVM to misclassify some points
For each training point (constraint) define i 0
such that i = 0 if correctly classified
and i =| y i (wT x + b) |> 0 for misclassified points
Recall that the primal problem is
arg min w,b w2 subject to y i (wT x i + b) 1 0
relax the constraints to

y i (wT x i + b) 1 , 0
but we want to be as small as possible. Change objective:
N
arg min w,b w2 +C i

i=1
17
New optimization problem

N
1
arg min w,b w2 +C i
2
i=1
subject to y i (wT x i + b) 1 , 0
18
Lagrangian
N
1
arg min w,b w2 +C i
2
i=1
subject to y i (wT x i + b) 1 , 0
The Lagrangian is now

N
1
L(w,b, , , ) = w2 +C i
2
i=1
m
[y (w x + b) 1+i )] i i
i
i=1
i=1
where i , i are the multipliers
19
The stationarity conditions give

m
w L(w,b, , , ) = 0 w = i y i x i
m
i=1
b L(w,b, , , ) = i y i = 0
As before
i=1
L(w,b, , , ) = 0 i = C i (*)
and the multipliers must satisfy i , i 0

We can use these equations to eliminate
w,b,
Note that the condition i 0 together with (*) implies
i C
20
Dual Problem
m
1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i
subject to 0 i C i = 1,...m
m
i y i = 0
i=1
Which is identical to the separable case except for the

Upper bound
m
After solving Dual for w = i y i x i Sis the support

iS
21
Bias: just as before

For any support vector y s (wT x s + b) = 1
(wT x s + b) = y s
Since y s = 1 or 1 (y s )2 = 1
We can solve for b

For a more stable computation, we use all support vectors
and take the average
1
j
T j
b =
[y
w
x ]
| S | jS
1
=
[y j i y i (x i )T x j ]
| S | jS
iS
1
j
i i
i
j
=
[y
y
K
(x
,
x
)]
| S | jS
iS
22
Making predictions: same formulae as before

We are now given an (unseen) point x and evaluate the sign of
m
wT x + b = [ i y i x i ]T x + b
i=1
since w = i y i x i
i=1
or more generally
m
w x + b = i y i K (x i , x) + b
T
i=1
Where we have defined K (x, y) = x T y

Crucial question: how to choose C ?
No simple answer
Often considered part of model selection
23
Supporting figures
24

SVM

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

SVM

Enviado por

Direitos autorais:

Formatos disponíveis

Support Vector Machines

One of the most useful tools in machine learning

Three blanket assumptions to start:

The discriminant function

b is called the bias

Observation 1: w is orthogonal to the decision boundary

(or in the opposite direction)

Distance from a point to the decision boundary

We consider the minimum distance for

And we want to find the classifier that

The problem of finding the maximum margin classifier is

This min-max problem is difficult to solve,

Then for all data points

The optimization problem

Quadratic optimization formulation

This is a standard problem that can be solved with a quadratic

L(x, ) = f (x) i ci (x)

Define the Lagrangian:

Any solution of the nonlinear optimization problem satisfies

x L(x* , ) = f (x* ) i ci (x* ) = 0

By complementarity, all multipliers i corresponding to

For our problem

Once the multipliers are computed we obtain w

for all training points that are not support vectors i = 0

After solving Dual for w = i y i x i Sis the support

We can solve for b

Where we have defined K (x, y) = x T y

In summary, after computing the multipliers, most of which

Go back and in the previous

The only change is:

Feature mapping: maps attributes (original inputs) to features

Soft Margin Classifier

Bishop, Figure 7.2

Binary classification in 2 dimensions. Results from an SVM using a

relax the constraints to

arg min w,b w2 +C i

New optimization problem

The Lagrangian is now

where i , i are the multipliers

The stationarity conditions give

and the multipliers must satisfy i , i 0

Which is identical to the separable case except for the

After solving Dual for w = i y i x i Sis the support

Bias: just as before

We can solve for b

Making predictions: same formulae as before

Where we have defined K (x, y) = x T y

Você também pode gostar