Você está na página 1de 24

Support Vector Machines

One of the most useful tools in machine learning


Optimal margin classifier
Kernels
Allow for the use of high dimensional feature vectors

Three blanket assumptions to start:


Data is separable
The discriminant function (prediction function) is linear
Binary classification {-1, 1}
1

The discriminant function


Linear discriminant function
(x) = wT x + b
w is called the weight vector
(x) 0 x is in class 1

b is called the bias


(x) < 0 x is in class 2

Observation 1: w is orthogonal to the decision boundary


Proof: consider two points on the decision boundary
(x1 ) = wT x1 + b = 0

(x ) = w x + b = 0w (x x ) = 0
2

x1
x2

(or in the opposite direction)


2

Distance from a point to the decision boundary


Given a point x, its distance to the line (hyperplane) defined by
the discrimant function is given by
| wT x + b |
| (x) |
=
w
w

We consider the minimum distance for


(x i )
all training points min i=1,...,m
w

And we want to find the classifier that


maximes this quantity: the gap
We impose the constraint that all
training points are correctly classified: {-1,1}
3

SVM formulation
Therefore we require y i (x i ) > 0 for all i
The distance from any point x i to the decision boundary is therefore

y i (x i )
y i (wT x i + b)
=
w
w

The problem of finding the maximum margin classifier is


y i (wT x i + b)
arg max w.b {min i=1,...m
}

This min-max problem is difficult to solve,


but it can be recast as a quadratic optimization
problem, as follows
4

SVM formulation
Let x j be a closest point to the decision boundary
then we can assume (!)

y j (wT x j + b) = 1

Then for all data points

y i (wT x i + b) 1

The optimization problem


y i (wT x i + b)
arg max w.b {min i=1,...m
}

Can be rewritten as
1
arg max w,b
subject to y i (wT x i + b) 1 0i = 1,..., m '
w
5

Quadratic optimization formulation


The final step is to observe the equivalence
1
arg max w,b
subject to y i (wT x i + b) 1 0
w
arg min w,b w2 subject to y i (wT x i + b) 1 0

This is a standard problem that can be solved with a quadratic


programming solver (interior point or active set). But if we stop
here we will fail to gain one of the most important insights in
machine learning. and will not have a powerful approach
The key is to look at the dual problem
6

Optimality conditions
Consider the optimization problem
min x f (x) subject to ci (x) 0,i = 1..., m
m

L(x, ) = f (x) i ci (x)

Define the Lagrangian:

i=1

Any solution of the nonlinear optimization problem satisfies


m

x L(x* , ) = f (x* ) i ci (x* ) = 0


i=1

stationarity

i ci (x* ) = 0 complementarity

i 0,ci (x* ) 0

feasibility

By complementarity, all multipliers i corresponding to


inactive constraints, (i.e.ci (x* ) > 0) are zero
7

For our problem


arg min w,b w2 subject to y i (wT x i + b) 1 0i = 1,..., m

The Lagrangian is
m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
Stationarity of the Lagrangian
m

w L(w,b, ) = w i y i x i = 0
m

i=1
i i

w = i yi x i

b L(w,b, ) = y = 0

i=1

i=1

Once the multipliers are computed we obtain w

m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
m

w = y x
i

i=1

y
i

=0

i=1

Substituting
m
m
m
1
2
i i
i
L(w,b, )= w y b + i y i (wT x i )
2
i=1
i=1
i=1
m
m
1
2
i
= w 0 + i y i (wT x i )
2
i=1
i=1

m
m
m
1
2
i
i i
= w + y ( j y j x j )T x i )
2
i=1
i=1
j=1
m
1
i
= i j y i y j (x i )T x j
2 i j
i=1

Dual Problem
m

1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i

subject to i 0i = 1,...m
m

i y i = 0
i=1

By complementarity

i [y i (wT x i + b) 1] = 0

for all training points that are not support vectors i = 0


m

After solving Dual for w = i y i x i Sis the support


iS

10

Bias
For any support vector y s (wT x s + b) = 1
(wT x s + b) = y s
Since y s = 1 or 1 (y s )2 = 1

We can solve for b


For a more stable computation, we use all support vectors
and take the average
1
j
T j
b =
[y

w
x ]

| S | jS

1
=
[y j i y i (x i )T x j ]

| S | jS
iS
1
j
i i
i
j
=
[y

y
K
(x
,
x
)]

| S | jS
iS
11

Making predictions
Suppose that we have computed our prediction function.
We are now given an (unseen) point x and compute
m

w x + b = [ y x ] x + b
T

i T

i=1
m

since w = i y i x i
i=1

wT x + b = i y i K (x i , x) + b
i=1

Where we have defined K (x, y) = x T y

In summary, after computing the multipliers, most of which


are zero, to make a prediction we need to evaluate the Kernel
function for every support point, and this is the dominant cost

12

Kernels
So far we have assumed that the feature vector contains simple
measurements
But we may have greater
x
1
mathematical flexibility if
x = x2
we also consider (.e.g) products
x3
x1
x x
1 2

(x) = 2
x3

x1 x2

Go back and in the previous


derivation replace x by (x)

The only change is:


K(x,y)= (x)T (y)

Feature mapping: maps attributes (original inputs) to features


13

Economies
Given we could compute (x)and (y)
and compute K (x, y) by taking their inner product
Important observation: often K (x, z) is much cheaper
to compute than (x) itself!
Or more precisely, we start we definitions of Kernels
that are inexpensive to compute and are mathematically
rich, and from there deduce what the feature vector is

14

Soft Margin Classifier


So far assumed that data are linearly separable in the
feature space (x)
The SVM algorithm will give exact separation in the
original input space x, even though decision boundary
may be nonlinear
In practice class conditional distributions may overlap;
exact separation can lead to poor generalization
Example

15

Bishop, Figure 7.2

Binary classification in 2 dimensions. Results from an SVM using a


Gaussian kernel.
Lines are contours of y(x); decision boundary, and margin boundaries

16

Slacks
Allow SVM to misclassify some points
For each training point (constraint) define i 0
such that i = 0 if correctly classified
and i =| y i (wT x + b) |> 0 for misclassified points
Recall that the primal problem is
arg min w,b w2 subject to y i (wT x i + b) 1 0

relax the constraints to


y i (wT x i + b) 1 , 0
but we want to be as small as possible. Change objective:
N

arg min w,b w2 +C i


i=1

17

New optimization problem


N
1
arg min w,b w2 +C i
2
i=1

subject to y i (wT x i + b) 1 , 0

18

Lagrangian
N
1
arg min w,b w2 +C i
2
i=1

subject to y i (wT x i + b) 1 , 0

The Lagrangian is now


N
1
L(w,b, , , ) = w2 +C i
2
i=1
m

[y (w x + b) 1+i )] i i
i

i=1

i=1

where i , i are the multipliers

19

The stationarity conditions give


m

w L(w,b, , , ) = 0 w = i y i x i
m

i=1

b L(w,b, , , ) = i y i = 0

As before

i=1

L(w,b, , , ) = 0 i = C i (*)

and the multipliers must satisfy i , i 0


We can use these equations to eliminate
w,b,
Note that the condition i 0 together with (*) implies

i C
20

Dual Problem
m

1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i

subject to 0 i C i = 1,...m
m

i y i = 0
i=1

Which is identical to the separable case except for the


Upper bound
m

After solving Dual for w = i y i x i Sis the support


iS

21

Bias: just as before


For any support vector y s (wT x s + b) = 1
(wT x s + b) = y s
Since y s = 1 or 1 (y s )2 = 1

We can solve for b


For a more stable computation, we use all support vectors
and take the average
1
j
T j
b =
[y

w
x ]

| S | jS

1
=
[y j i y i (x i )T x j ]

| S | jS
iS
1
j
i i
i
j
=
[y

y
K
(x
,
x
)]

| S | jS
iS
22

Making predictions: same formulae as before


We are now given an (unseen) point x and evaluate the sign of
m

wT x + b = [ i y i x i ]T x + b
i=1

since w = i y i x i
i=1

or more generally
m

w x + b = i y i K (x i , x) + b
T

i=1

Where we have defined K (x, y) = x T y


Crucial question: how to choose C ?
No simple answer
Often considered part of model selection

23

Supporting figures

24

Você também pode gostar