Chap6 1-KernelMethods

Machine Learning
!Srihari
Kernel Methods!
Sargur Srihari!
Machine Learning
!Srihari
Topics in Kernel Methods!
1.
2.
3.
4.
5.
6.
7.
Kernel Methods vs Linear Models/Neural Networks!

Stored Sample Methods!
Kernel Functions!
Dual Representations!
Constructing Kernels!
Extension to Symbolic Inputs!
Fisher Kernel!
2
Machine Learning
!Srihari
Kernel Methods vs Linear Models/Neural Networks!
Linear parametric models for regression and

classification have the form y(x,w)
During learning phase we either get a maximum likelihood
estimate of w or a posterior distribution of w
Training data is then discarded
Prediction based only on vector w
This is true of Neural networks as well

Another class of methods use the training samples or
a subset of them
3
Machine Learning
!Srihari
Memory-Based Methods!
Training data points are used in prediction phase!
Examples of such methods!
Parzen probability density model!
Linear combination of kernel functions centered on each
training data point!
Nearest neighbor classification!
These are memory-based methods!

Require a metric to be defined!
Fast to train, slow to predict!
4
Machine Learning
!Srihari
Kernel Functions!
Many linear parametric models can be re-cast into
equivalent dual representations where predictions are
based on a kernel function evaluated at training points!
Kernel function is given by!
!
!k (x,x) = (x)T (x)
where (x) is a fixed nonlinear feature space mapping (basis
function)!
Kernel is a symmetric function of its arguments!

!
!k (x,x) = k (x,x)
Kernel function can be interpreted as the similarity of x
and x
Simplest is identity mapping in feature space (x) = x
In which case k (x,x) = xTx
Called Linear Kernel!
Machine Learning
!Srihari
Kernel Trick (or Kernel Substitution)!

Formulated as inner product allows extending wellknown algorithms !
by using the kernel trick!
Basic idea of kernel trick !

If an input vector x appears only in the form of scalar products
then we can replace scalar products with some other choice of
kernel!
Used widely!
in support vector machines
in developing non-linear variant of PCA
In kernel Fisher discriminant
6
Machine Learning
!Srihari
Other Forms of Kernel Functions!

Function of difference between
arguments!
!
!k (x,x) = k (x-x)
Called stationary kernel since invariant to
translation in space!
Homogeneous kernels, also known

as radial basis functions!
!
!k (x,x) = k (||x-x||)
For these to be valid

kernel functions
they should be shown
to have the property
k (x,x) = (x)T (x)
Depend only on the magnitude of the

distance between arguments!
Note that the kernel function is a

scalar value while x is an Mdimensional vector!
!!
Machine Learning
!Srihari
Dual Representation!
Linear models for regression and classification can be
reformulated in terms of a dual representation!
In which kernel function arises naturally!
Plays important role in SVMs!

Consider linear regression model!
whose parameters are determined by minimizing regularized
sum-of-squares error function!
N
2
1
J(w) = {wT (x n ) t n } + wT w
2 n=1
2
where w = (w0 ,.., w M-1 )T , = ( 0 ,.. M-1 )T

we have N samples {x1 ,..x N }
is the regularization coefficient
is the set of M
basis functions
or feature vector
Minimum obtained by setting gradient of J(w) wrt w equal to zero!
Machine Learning
!Srihari
Solution for w as a linear combination of (xn)!

By equating derivative J(w) wrt w to zero and solving for w
we get!
1
w=
w (x ) t (x )
N
n=1
= an (x n )
n=1
= T a
!
Solution for w is a linear combination of vectors (xn)
whose coefficients
are functions of w where!
is the design matrix whose nth row is given by (xn)T
% 0 (x1 )
'
' .
= ' 0 (x n )
' .
'& (x )
0
N
.
.
.
M 1 (x1 ) (
*
.
*
. M 1 (x n )* is a N M matrix
*
.
. M 1 (x N )*)
.
Vector a=(a1,..,aN)T with the definition !
!
an =
1 T
{w (x n ) tn }
Machine Learning
!Srihari
Transformation from w to a
!
Thus we have ! w = T a
Instead of working with parameter vector w we can
reformulate least squares algorithm in terms of parameter
vector a
giving rise to dual representation!
We will see that although the definition of a still includes w

an =
1 T
{w (x n ) tn }
!It can be eliminated by the use of the kernel function!
10
Machine Learning
!Srihari
Gram Matrix and Kernel Function!

Define the Gram matrix K=T an N x N matrix, with
elements!
Note: N x M times M x N
!
!
!Knm= (xn)T (xm)=k(xn,xm)
where we introduce the kernel function k (x,x) = (x)T (x)
" k(x1,x1 )
$
.
K =$
$
.
$
#k(x N ,x1 )
k(x1,x N ) %
'
'
'
'
k(x N ,x N1 )&
Gram Matrix Definition:!

Given N vectors, it is the!
matrix of all inner products!
Notes:
is NxM and K is NxN
K is a matrix of similarities of pairs of samples (thus it is symmetric)
11
Machine Learning
!Srihari
Error Function in Terms of Gram Matrix of Kernel!

!
Sum of squares Error Function is!

2
1 N
J(w) = {wT (x n ) t n } + wT w
2 n=1
2
Substituting w = Ta into J(w) gives!
1
1
J(w) = a T T T a a T T t + t T t + a T T a
2
2
2
!where t = (t1,..,tN)T!
!
Sum of squares error function is written in terms of Gram
matrix as!
1
1
J(a) = a T KKa a T Kt + t T t + a T Ka
!
2
2
2
1
Solving for a by combining w=Ta and a = {w (x ) t }
!
!
!a =(K +IN)-1t
T
for a can be expressed as a linear combination of elements of

Solution
(x) whose coefficients are entirely in terms of kernel k(x,x) from

which we can recover original formulation in terms of parameters w
Machine Learning
!Srihari
Prediction Function!
!
Prediction for new input x

We can write a =(K +IN
by combining
and
Substituting back into linear regression model, !
)-1t
y(x) = wT (x)
w=Ta
an =
1 T
{w (x n ) tn }
= a T (x)
= k(x)T (K + IN )1 t where k(x) has elements k n (x) = k(x n ,x)
Prediction is a linear combination of the target

values from the training set.!
!
Machine Learning
Advantage of Dual Representation!

Solution for a is expressed entirely in terms of kernel
function k(x,x)
Once we get a we can recover w as linear
combination of elements of (x) using w = ta
In parametric formulation, solution is !w ML = (T )1 T t
Instead of inverting an M x M matrix we are inverting an N x
N matrix an apparent disadvantage!
But, advantage of dual formulation is that we can

work with kernel function k(x,x) and therefore!
avoid working with a feature vector (x) and !
problems associated with very high or infinite dimensionality
14
of x
!Srihari
Machine Learning
!Srihari
Constructing Kernels!
To exploit kernel substitution need valid kernel
functions!
First Method!
choose a feature space mapping (x) and use it to find
corresponding kernel!
One-dimensional input space!
k(x, x') = (x)T (x')
M
= i (x) i (x')
i=1
where (x) are basis functions such as polynomial!

For each i we choose i=xi
15
Machine Learning
!Srihari
Construction of Kernel Functions from basis functions!

One-dimensional input space
Polynomials!
Gaussian!
Logistic Sigmoid!
Basis!
Functions!
(x)!
Kernel!
Functions!
k(x,x) = (x)T(x)
Red cross is x
16
Machine Learning
!Srihari
Second Method: Direct Construction of Kernels!

Function we choose has to correspond to a scalar product in
some (perhaps infinite dimensional) space!
Consider kernel function k(x,z) = (xTz)2
In two dimensional space! k(x,z) = (x T z) 2 = (x1z1 + x 2 z2 ) 2
= x12 z12 + 2x1z1 x 2 z2 + x 22 z22
= (x12 , 2x1 x 2 , x 22 )(z12 , 2z1z2 ,z22 )T
T
= (x) (z)
!
Feature mapping takes the form! ( x) = ( x12 , 2 x1 x 2 , x 22 )
Comprises of all second
order terms with a specific weighting!
Inner product needs computing six feature values and 3 x 3 = 9

multiplications
Kernel function k(x,z) has 2 multiplications and a squaring
By considering (xTz+c)2 we get constant, linear, second order

terms!
By considering (xTz+c)M we get all powers of x (monomials)!17
Machine Learning
!Srihari
Testing whether a function is a valid kernel!

Without having to construct the function (x) explicitly!
Necessary and sufficient condition for a function k(x,x)
to be a kernel is !
Gram matrix K, whose elements are given by k(xn,xm) is positive
semi-definite for all possible choices of the set {xn}
Positive semi-definite is not the same thing as a matrix whose
elements are non-negative!
It means! zT Kz 0 for non - zero vectors z with real entries
i.e., K nm zn zm 0 for any real numbers zn , zm
n
Mercers theorem: any continuous, symmetric, positive semidefinite kernel function k(x, y) can be expressed as a dot product in
a high-dimensional
space!
New kernels can be constructed from simpler kernels as

building blocks!
18
Machine Learning
!Srihari
Techniques for Constructing Kernels!

Given valid kernels k1(x,x) and k2(x,x) the following
new kernels will be valid!
1. k(x,x) =ck1(x,x)
Where!
f (.) is any function
2. k(x,x)=f(x)k1(x,x)f(x)
3. k(x,x)=q(k1(x,x))
q(.) is a polynomial with non-negative coefficients!

4. k(x,x)=exp(k1(x,x))

5. k(x,x)=k1(x,x)+k2(x,x)

6. k(x,x)=k1(x,x)k2(x,x)

(x) is a function from x to RM
7. k(x,x)=k3((x).(x))
k3 is a valid kernel in RM
8. k(x,x)=xTAx
A is a symmetric positive semidefinite matrix!
9. k(x,x)=ka(xa,xb)+kb(xb,xb)xa and xb are variables with x=(xa,xb)
10. k(x,x)=ka(xa,xa)kb(xb,xb)ka and kb are valid kernel functions!
Machine Learning
!Srihari
Kernels appropriate for specific applications!

Requirements for k(x,x)
It is symmetric!
Its Gram matrix is positive semidefinite!
It expresses the appropriate similarity between x and x for
the intended application!
20
Machine Learning
!Srihari
Gaussian Kernel!
Commonly used kernel is!
! k(x,x) = exp (-||x-x||2/22)
It is seen as a valid kernel by expanding the square!
! ||x-x||2 = xTx + (x)Tx -2xTx
To give!
k(x,x) = exp (-xTx/22) exp (-xTx/2) exp (-(x)Tx/
22)
From kernel construction rules 2 and 4 !
together with validity of linear kernel k(x,x)=xTx
Can be extended to non-Euclidean distances

k(x,x) = exp {(-1/22)[(x,x)+(x,x)-2(x,x)]}
21
Machine Learning
!Srihari
Extension of Kernels to Symbolic Inputs!

Important contribution of kernel
viewpoint:!
Inputs that are symbolic rather than vectors
of real numbers!
Kernel functions defined for graphs,

sets, strings, text documents!
If A1 and A2 are two subsets of objects!
A simple kernel is!
k(A1, A2 ) = 2|A1 A 2 |
where | | indicates cardinality of set

intersection!
A valid kernel since it can be shown to
correspond to an inner product in a feature
space!
A={1,2,3,4,5}
A1={2,3,4,5}
A2={1,2,4,5}
A1A2={2,4,5}
Hence k(A1,A2)=8
What are feature vectors
(A1) and (A2)
such that
(A1)(A2)T=8?
Machine Learning
!Srihari
Combining Discriminative and Generative Models!

Generative models deal naturally with missing data
and with HMM of varying length!
Discriminative models such as SVM have better
performance!
Can use a generative model to define a kernel and
use kernel in discriminative approach!
23
Machine Learning
!Srihari
Kernels based on Generative Models!

Given a generative model p(x) we define a kernel by!
!
!k (x,x) = p(x) p(x)
A valid kernel since it is an inner product in the one-dimensional
feature space defined by the mapping p(x)
Two inputs x and x are similar if they have high

probabilities
24
Machine Learning
!Srihari
Kernel Functions based on Mixture Densities!

Extension to sums of products of different
probability distributions
k(x,x') = p(x | i) p(x'| i) p(i)
i
where p(i) are positive weighting coefficients

It is a valid kernel based on two rules of kernel
construction:
k(x,x) =ck1(x,x) and k(x,x)=k1(x,x)+k2(x,x)
Two inputs x and x will give a large value of k ,

and hence appear similar, if they have a
significant probability under a range of different
components
Taking the limit to infinite sum
k(x,x') = p(x | z) p(x'| z) p(z)dz
where z is a continuous latent variable
25
Machine Learning
!Srihari
Kernels for Sequences!

Data consists of ordered sequences of length L
X={x1,..,xL}
Generative model for sequences is HMM
Hidden states Z={z1,..,zL}
Kernel Function for measuring similarity of sequences X and X

is
k(X,X') = p(X | Z) p(X'| Z') p(Z)
Z
Both observed sequences are generated by same hidden sequence Z
26
Machine Learning
!Srihari
Fisher Kernel!
Alternative technique for using generative models!
Used in document retrieval, protein sequences, document recognition!
Consider parametric generative model p(x|) where denotes

vector of parameters!
Goal: find kernel that measures similarity of two vectors x and
x induced by the generative model!
Define Fisher score as gradient wrt

g(,x) = ln p(x | )
Fisher Kernel is!
A vector of same dimensionality as

Fisher score is more
generally the gradient
of the log-likelihood
k(x,x') = g(,x)T F1g(,x)
where F is the Fisher information matrix!

F = E x [ g(,x)g(,x)T ]
27
Machine Learning
!Srihari
Fisher Information Matrix!

Presence of Fisher information matrix causes kernel
to be invariant under non-linear parametrization of
the density model ()
In practice, infeasible to evaluate Fisher Information
Matrix. Instead use the approximation
1 N
F g(,x n )g(,x n )T
N n=1
!
This is the covariance matrix of the Fisher scores!
T 1
So the
Fisher
kernel!
k(x,x')
=
g(
,x)
F g(,x)
!corresponds to whitening of the Fisher scores!

More simply omit F and use non-invariant kernel!
k(x,x') = g(,x)T g(,x')
28
Machine Learning
!Srihari
Sigmoidal Kernel!
Provides a link between SVMs and neural networks!
!
!k (x,x) = tanh (axTx + b)
Its Gram matrix is not positive semidefinite!
But used in practice because it gives SVMs a superficial
resembalance to neural networks!
Bayesian neural network with an appropriate prior

reduces to a Gaussian process!
Provides a deeper link between neural networks and kernel
methods!
29

Chap6 1-KernelMethods

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Chap6 1-KernelMethods

Enviado por

Direitos autorais:

Formatos disponíveis

Machine Learning

Topics in Kernel Methods!

Kernel Methods vs Linear Models/Neural Networks!

Kernel Methods vs Linear Models/Neural Networks!

Linear parametric models for regression and

This is true of Neural networks as well

Nearest neighbor classification!

These are memory-based methods!

Kernel is a symmetric function of its arguments!

Kernel Trick (or Kernel Substitution)!

Basic idea of kernel trick !

Other Forms of Kernel Functions!

Homogeneous kernels, also known

For these to be valid

Depend only on the magnitude of the

Note that the kernel function is a

Plays important role in SVMs!

where w = (w0 ,.., w M-1 )T , = ( 0 ,.. M-1 )T

Minimum obtained by setting gradient of J(w) wrt w equal to zero!

Solution for w as a linear combination of (xn)!

Vector a=(a1,..,aN)T with the definition !

We will see that although the definition of a still includes w

!It can be eliminated by the use of the kernel function!

Gram Matrix and Kernel Function!

Gram Matrix Definition:!

Error Function in Terms of Gram Matrix of Kernel!

Sum of squares Error Function is!

Substituting w = Ta into J(w) gives!

for a can be expressed as a linear combination of elements of

(x) whose coefficients are entirely in terms of kernel k(x,x) from

Prediction for new input x

Prediction is a linear combination of the target

Advantage of Dual Representation!

But, advantage of dual formulation is that we can

where (x) are basis functions such as polynomial!

Construction of Kernel Functions from basis functions!

Second Method: Direct Construction of Kernels!

Inner product needs computing six feature values and 3 x 3 = 9

Kernel function k(x,z) has 2 multiplications and a squaring

By considering (xTz+c)2 we get constant, linear, second order

Testing whether a function is a valid kernel!

New kernels can be constructed from simpler kernels as

Techniques for Constructing Kernels!

Kernels appropriate for specific applications!

Can be extended to non-Euclidean distances

Extension of Kernels to Symbolic Inputs!

Kernel functions defined for graphs,

where | | indicates cardinality of set

Combining Discriminative and Generative Models!

Kernels based on Generative Models!

Two inputs x and x are similar if they have high

Kernel Functions based on Mixture Densities!

where p(i) are positive weighting coefficients

k(x,x) =ck1(x,x) and k(x,x)=k1(x,x)+k2(x,x)

Two inputs x and x will give a large value of k ,

Kernels for Sequences!

Kernel Function for measuring similarity of sequences X and X

Both observed sequences are generated by same hidden sequence Z

Consider parametric generative model p(x|) where denotes

Fisher Kernel is!

A vector of same dimensionality as

k(x,x') = g(,x)T F1g(,x)

where F is the Fisher information matrix!