Escolar Documentos
Profissional Documentos
Cultura Documentos
(x ) = w x + b = 0w (x x ) = 0
2
x1
x2
SVM formulation
Therefore we require y i (x i ) > 0 for all i
The distance from any point x i to the decision boundary is therefore
y i (x i )
y i (wT x i + b)
=
w
w
SVM formulation
Let x j be a closest point to the decision boundary
then we can assume (!)
y j (wT x j + b) = 1
y i (wT x i + b) 1
Can be rewritten as
1
arg max w,b
subject to y i (wT x i + b) 1 0i = 1,..., m '
w
5
Optimality conditions
Consider the optimization problem
min x f (x) subject to ci (x) 0,i = 1..., m
m
i=1
stationarity
i ci (x* ) = 0 complementarity
i 0,ci (x* ) 0
feasibility
The Lagrangian is
m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
Stationarity of the Lagrangian
m
w L(w,b, ) = w i y i x i = 0
m
i=1
i i
w = i yi x i
b L(w,b, ) = y = 0
i=1
i=1
m
1
L(w,b, ) = w2 i [y i (wT x i + b 1)]
2
i=1
m
w = y x
i
i=1
y
i
=0
i=1
Substituting
m
m
m
1
2
i i
i
L(w,b, )= w y b + i y i (wT x i )
2
i=1
i=1
i=1
m
m
1
2
i
= w 0 + i y i (wT x i )
2
i=1
i=1
m
m
m
1
2
i
i i
= w + y ( j y j x j )T x i )
2
i=1
i=1
j=1
m
1
i
= i j y i y j (x i )T x j
2 i j
i=1
Dual Problem
m
1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i
subject to i 0i = 1,...m
m
i y i = 0
i=1
By complementarity
i [y i (wT x i + b) 1] = 0
10
Bias
For any support vector y s (wT x s + b) = 1
(wT x s + b) = y s
Since y s = 1 or 1 (y s )2 = 1
w
x ]
| S | jS
1
=
[y j i y i (x i )T x j ]
| S | jS
iS
1
j
i i
i
j
=
[y
y
K
(x
,
x
)]
| S | jS
iS
11
Making predictions
Suppose that we have computed our prediction function.
We are now given an (unseen) point x and compute
m
w x + b = [ y x ] x + b
T
i T
i=1
m
since w = i y i x i
i=1
wT x + b = i y i K (x i , x) + b
i=1
12
Kernels
So far we have assumed that the feature vector contains simple
measurements
But we may have greater
x
1
mathematical flexibility if
x = x2
we also consider (.e.g) products
x3
x1
x x
1 2
(x) = 2
x3
x1 x2
Economies
Given we could compute (x)and (y)
and compute K (x, y) by taking their inner product
Important observation: often K (x, z) is much cheaper
to compute than (x) itself!
Or more precisely, we start we definitions of Kernels
that are inexpensive to compute and are mathematically
rich, and from there deduce what the feature vector is
14
15
16
Slacks
Allow SVM to misclassify some points
For each training point (constraint) define i 0
such that i = 0 if correctly classified
and i =| y i (wT x + b) |> 0 for misclassified points
Recall that the primal problem is
arg min w,b w2 subject to y i (wT x i + b) 1 0
17
subject to y i (wT x i + b) 1 , 0
18
Lagrangian
N
1
arg min w,b w2 +C i
2
i=1
subject to y i (wT x i + b) 1 , 0
[y (w x + b) 1+i )] i i
i
i=1
i=1
19
w L(w,b, , , ) = 0 w = i y i x i
m
i=1
b L(w,b, , , ) = i y i = 0
As before
i=1
L(w,b, , , ) = 0 i = C i (*)
i C
20
Dual Problem
m
1
max L( ) = i j y i y j (x i )T x j
2 i j
i=1
i
subject to 0 i C i = 1,...m
m
i y i = 0
i=1
21
w
x ]
| S | jS
1
=
[y j i y i (x i )T x j ]
| S | jS
iS
1
j
i i
i
j
=
[y
y
K
(x
,
x
)]
| S | jS
iS
22
wT x + b = [ i y i x i ]T x + b
i=1
since w = i y i x i
i=1
or more generally
m
w x + b = i y i K (x i , x) + b
T
i=1
23
Supporting figures
24