Escolar Documentos
Profissional Documentos
Cultura Documentos
Dissertation
By
Di You, M.S.
2011
Dissertation Committee:
Aleix M. Martinez, Adviser
Yuan F. Zheng
Yoonkyung Lee
c Copyright by
Di You
2011
ABSTRACT
Kernel methods have been extensively studied in pattern recognition and machine
learning over the last decade, and they have been successfully used in a variety of
classification and regression can be efficiently solved using classical linear approaches.
The performance of kernel methods greatly depends on the selected kernel model. The
model is defined by the kernel mapping and its parameters. Different models result in
propose several approaches to address this problem. Our approaches can determine
good learning models by optimizing both the kernels and all other parameters in the
kernel-based algorithms.
linearly separable in the kernel space. The idea is to enforce the homoscedasticity
and separability of the pairwise class distributions simultaneously in the kernel space.
directly minimizing the Bayes classification error over different kernel mappings.
ii
develop an algorithm to obtain the Pareto-optimal solutions which balance the trade-
off between the model fit and model complexity. We show how the proposed method
In our final algorithm, the kernel matrix is recursively learned with genetic algo-
family of adaptive kernels to better fit the data with various densities and show their
Extensive experimental results demonstrate that the proposed approaches are su-
iii
To my parents and my wife
iv
ACKNOWLEDGMENTS
First of all, I greatly thank my advisor, Dr. Aleix M. Martinez for his guidance,
support and patience to my PhD work. I have learned a lot from him including the
rigorous scientific attitude, methods to do good research, and the spirit of a researcher.
I also would like to thank all my friends and my labmates: Onur Hamsici, Hongjun
Jia, Liya Ding, Paulo Gotardo, Samuel Riveras, Fabian Benitez-Quiroz, Shichuan Du,
Yong Tao, and Felipe Giraldo. I benefit a lot from the many discussions with them
parents who give me the endless love, care, and support so that I can finish this long
and difficult process. I am also grateful to my wife for her love and encouragement
v
VITA
PUBLICATIONS
Research Publications
vi
D. You and A. M. Martinez. Kernel Matrix Learning with Genetic Algorithm.
submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.
FIELDS OF STUDY
vii
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The metrics of discriminant analysis . . . . . . . . . . . . . . . . . 24
2.3 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Maximizing homoscedasticity . . . . . . . . . . . . . . . . . 30
2.3.2 Derivation of the Gradient . . . . . . . . . . . . . . . . . . . 38
2.3.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
2.4 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Bayes accuracy in the kernel space . . . . . . . . . . . . . . 43
2.4.2 Kernel parameters with gradient ascent . . . . . . . . . . . 45
2.4.3 Subclass extension . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.4 Optimal subclass discovery . . . . . . . . . . . . . . . . . . 47
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . 50
2.5.2 KBA criterion . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.1 Generalization error . . . . . . . . . . . . . . . . . . . . . . 66
3.2.2 Model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.3 Roughness penalty in RBF . . . . . . . . . . . . . . . . . . 70
3.2.4 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.5 Comparison with other complexity measure . . . . . . . . . 73
3.3 Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Pareto-Optimality . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.2 The -constraint approach . . . . . . . . . . . . . . . . . . . 77
3.3.3 The modified -constraint . . . . . . . . . . . . . . . . . . . 79
3.3.4 Alternative Optimization Approaches . . . . . . . . . . . . . 83
3.4 Applications to Regression . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . 84
3.4.2 Kernel Principal Component Regression . . . . . . . . . . . 86
3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.1 Standard data-sets . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.2 Comparison with the state of the art . . . . . . . . . . . . . 92
3.5.3 Alternative Optimizations . . . . . . . . . . . . . . . . . . . 95
3.5.4 Comparison with the L2 norm . . . . . . . . . . . . . . . . . 95
3.5.5 Age estimation . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.6 Weather prediction . . . . . . . . . . . . . . . . . . . . . . . 97
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
ix
4.2.3 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2.4 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Kernel Parameter Selection . . . . . . . . . . . . . . . . . . . . . . 112
4.3.1 k-fold cross-validation . . . . . . . . . . . . . . . . . . . . . 112
4.3.2 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . 112
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.1 UCI benchmark data-sets . . . . . . . . . . . . . . . . . . . 113
4.4.2 Image databases . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
x
LIST OF TABLES
Table Page
2.6 Recognition rates (%) with nearest neighbor. Bold numbers specify the
top recognition obtained with the three criteria in KSDA and KDA.
An asterisk specifies a statistical significance on the highest recognition
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Results for KRR. Mean RMSE and standard deviation (in parentheses). . 89
3.2 Results for KPCR. Mean RMSE and standard deviation (in parentheses). . 89
xi
3.7 MAE of the proposed approach and the state of the art in age estimation. 97
4.2 Recognition rates (%) with KBA criterion in UCI data-sets. . . . . . 115
4.6 Recognition rates (%) with KBA criterion in PIE database. . . . . . . 118
5.2 KDA Recognition rates (in percentages) in the UCI data-sets. . . . . 142
5.4 Average training time (in seconds) of each algorithm in the UCI data-sets.144
5.7 Average training time (in seconds) of each algorithm in large data-sets. 147
5.10 MAE of the proposed approach and the state-of-the-art in age estimation. 154
xii
LIST OF FIGURES
Figure Page
1.1 This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can
be mapped to a higher dimensional space where the data becomes linearly
separable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample
similarity on x is determined by the nearby samples of x. (b) The polynomial
kernel. d is the degree of the kernel. The kernel value measuring the sample
similarity on x is determined by all the samples. . . . . . . . . . . . . . 9
2.1 Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale
and rotation. (a) The value of Q1 decreases as the angle increases. The
2D rotation between the two distributions is in the x axis. The value of Q1
is in the y axis. (b) When = 0o , the two distributions are homoscedastic,
and Q1 takes its maximum value of .5. Note how for distributions that are
close to homoscedastic (i.e., 0o ), the value of the criterion remains high.
(c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3. 33
2.2 Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions
that have small training error. (c) However, when the model complexity is
small, = 3, KDA fails. (d) KSDAs solution resolves this problem with
piecewise smooth, nonlinear classifiers. . . . . . . . . . . . . . . . . . . . 41
xiii
2.3 The original data distributions are mapped to different kernel spaces via
different mapping functions (.). (2 ) is better than (1 ) in terms of the
Bayes error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4,
and (d,h) 5. The x-axis specifies the number of subclasses Hi . The y-axis
shows the value of the criterion given in (2.12) in (a-d) and of the Fisher
criterion in (e-h). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion. . . . . . . . . . . . . . . . . . 49
2.6 Shown here are (a) 8 categories in ETH-80 database and (b) 10 different
objects for the cow category. . . . . . . . . . . . . . . . . . . . . . . . 51
2.7 Plots of the value of the derived criterion as a function of the kernel param-
eter and the number of subclasses. From left to right and top to bottom:
AR, ETH-80, Monk 1, and Ionosphere databases. . . . . . . . . . . . . . 60
3.1 The two plots in this figure show the contradiction between the RSS and
the curvature measure with respect to: (a) the kernel parameter , and (b)
the regularization parameter in Kernel Ridge Regression. The Boston
Housing data-set [7] is used in this example. Note that in both cases, while
one criterion increases, the other decreases. Thus, a compromise between
the two criteria ought to be determined. . . . . . . . . . . . . . . . . . . 72
3.2 Here we show a case of two objective functions. u(S) represents the set
of all the objective vectors with the Pareto frontier colored in red. The
Pareto-optimal solution can be determined by minimizing u1 given that
u2 is upper-bounded by . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Comparison between the proposed modified and the original -constraint
methods. We have used * to indicate the objective vector and o to spec-
ify the solution vector. Solutions given by (a) the -constraint method
and (b) the proposed modified -constraint approach on the first exam-
ple, and (c) the -constraint method and (d) the modified -constraint ap-
proach on the second example. Note that the proposed approach identifies
the Pareto-frontier, while the original algorithm identifies weakly Pareto-
solutions, since the solution vectors go beyond the Pareto-frontier. . . . . 82
xiv
3.4 Sample images showing the same person at different ages. . . . . . . . . . 97
3.5 This figure plots the estimated (lighter dashed curve) and actual (darker
dashed curve) maximum daily temperature for a period of more than 200
days. The estimated results are given by the algorithm proposed in this
chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1 A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density
Adaptive (LA) kernels are evaluated on the four points marked by . (a)
Density estimation in the RBF kernel uses a fixed window, illustrated by
black circles. Note that this fixed window cannot capture different local
densities. (b) Density estimation with the proposed LA kernel. . . . . . . 102
4.2 This figure illustrates how the local variance measurement given by (4.7)
is used. The axis represents the magnitude of the variance around each
sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under dif-
ferent covariance factors c. The proposed kernel obtains higher classification
accuracies than the RBF as c increases. . . . . . . . . . . . . . . . . . . 111
4.4 Shown here are sample images from PIE data-set. . . . . . . . . . . . . 116
5.1 (a) The classical feature representation. Each entry in the feature vector
codes for a relevant variable in the optimization problem. (b) The proposed
feature representation. Each individual in the population is represented as a
feature vector with coding and non-coding segments. The lower case letters
represent the coding (or gene) sequence used for the calculation of the fitness
function. Consecutive N labels indicate non-coding DNA. . . . . . . . . 124
5.4 This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is
formed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
xv
5.5 (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the
classification accuracy over the number of generations. . . . . . . . . . . 136
5.6 In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations. . . . . . . . . . . . . . . . . . . . 137
5.7 This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations. . . . . . . . . . . . . . . . . . . . . 139
5.8 Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization
approaches applied to KDA using monk1 database, and the plots from (f)
to (j) were obtained with different optimization approaches applied to SVM
using breast cancer database. (a) and (f) show the proposed genetic-based
optimization approach. (b) and (g) show the traditional GA algorithm with
crossover and mutation only. (c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only.
(e) and (j) show GA algorithm with insertion operator only. . . . . . . . 145
xvi
CHAPTER 1
INTRODUCTION
The goal of Pattern recognition is to describe, recognize, classify, and group pat-
terns of interest. While it seems an easy task for humans, such as identifying a
recognize patterns.
Over the decades, extensive research has been conducted in this field and a number
of approaches for pattern recognition have been developed. These pattern recogni-
tion techniques have been widely used in a variety of fields such as computer vi-
statistical pattern recognition approaches have been most intensively studied and em-
proaches are employed to build the decision boundary or model the distribution of the
data. Depending on whether the training samples are labeled or unlabeled, statistical
1
pattern recognition can be divided into two categories: supervised and unsupervised
methods.
the objects and their associated labels. If the labels are discrete, the correspond-
Linear Discriminant Analysis (LDA) [32] and Support Vector Machines (SVMs) [93].
If the labels are continuous, we talk about regression. The least-squares solutions
and their variants [42] (e.g. ridge regression) are popular approaches for regression.
Unsupervised learning seeks to determine how the data are organized. Data represen-
tation (e.g., principal component analysis) and clustering (e.g., k-means) are typical
Among the many approaches to supervise learning that have been developed thus
far, Discriminant Analysis (DA) is one of the earliest and most used techniques in
pattern recognition. It has been used for feature extraction and classification with
broad applications in, for instance, computer vision [32], gene expression analysis [23]
and paleontology [62]. In his ground-breaking work, Fisher [27, 28] derived a DA
approach for the two Normally distributed class problem, N(1 , 1 ) and N(2 , 2 ),
under the assumption of equal covariance matrices, 1 = 2 . Here, i and i are the
mean feature vector and the covariance matrix of the ith class, and N(.) represents the
implies that the Bayes (optimal) classifier is linear, which is the reason why we refer
to this algorithm as Linear Discriminant Analysis (LDA). LDA thus provides the
2
one-dimensional subspace where the Bayes classification error is the smallest in the
a least-squares framework [78]. In this solution, LDA employs two symmetric, posi-
tive semi-definite matrices, each defining a metric [63]. One of these metrics should
measure within-class differences and, as such, should be minimized. The other metric
should account for between-class dissimilarity and should thus be maximized. Classi-
cal choices for the first metric are the within-class scatter matrix SW and the sample
covariance matrix X , while the second metric is usually given by the between-class
is the prior of class i. LDAs solution is then given by the generalized eigenvalue
decomposition equation 1
X SB V = V, where the columns of V are the eigenvectors,
The idea of LDA is attractive, because we could obtain a linear classifier with the
smallest classification error (also known as the Bayes error) provided the fact that
3
the class distributions are single Gaussians and the covariance matrices are identical.
However, in practice, the class distributions can be highly non-Gaussian and distinct
from each other, which makes the assumption of LDA so restrictive. In other words, if
the real data distributions deviate from this underlying assumption, then LDA would
To relax this assumption, numerous approaches have been proposed in the litera-
ture. Loog and Duin [57] defines a within-class similarity metric using the Chernoff
distance which incorporates the differences of both means and covariance matrices,
tributions. Another way is to allow each class to be divided into several subclasses
by imposing a mixture of Gaussians for each class distribution. This is the underly-
ing idea of subclass DA (SDA) [116]. Since a mixture of Gaussians is more flexible
to model the underlying class distributions than a single Gaussian, this approach is
shown to perform well for a variety of applications. To loosen the parametric restric-
tion of the above assumption, Fukunaga and Mantock [31] redefines the between-class
locally. Specifically, a local classifier for each sample is first built based upon the
sample and its local k nearest neighbors, and then the final decision boundary is
The classifiers obtained from the above approaches are linear or piecewise linear.
However, such classifiers may not be adequate for a classification problem with a
highly nonlinear decision boundary. This is because that the features in such ap-
proaches are extracted from a linear combination of the features in the original space.
4
be more appropriate. Recently, kernel methods have been developed to tackle the
nonlinear problem.
Kernel methods have attracted great interest over the past decade and have been
shown its promise in performing nonlinear feature extraction and classification [84,
93]. The idea is to use a kernel function which maps the original nonlinearly separable
data to a very high or even infinite dimensional space where the data is linearly
separable, see Figure 1.1. Then, any efficient linear classification approach can be
employed in this so-called kernel space. Since the mapping is intrinsic, one does not
need to work with an explicit mapping function. Instead, one can employ the kernel
trick [84], allowing nonlinear formulations to be cast in terms of inner products. This
will result in a space of the same dimensionality as that of the input representation
functional relationship between x and y (Note that in classification, the class label is
obtained by sgn (f (x)), where sgn(.) is the sign function.). f (x) can be modeled as
f (x) = wT x + b, (1.3)
fails to capture the nonlinearity that usually exists in the data. In this case, kernel
5
Figure 1.1: This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can be mapped to
a higher dimensional space where the data becomes linearly separable.
Let (.) : Rp F be a function defining a kernel mapping which maps the data
in the original space to the kernel space defined by F . Then (1.3) can be rewritten
as
T
f (x) = w (x) + b, (1.4)
where w is the weight vector in the kernel space. Unfortunately, the dimensionality
of F may be too large, which makes it difficult to work with the explicit features
in the kernel space. To bypass this problem, the kernel trick [84] is generally used.
Specifically, from the Representers Theorem [96], the weight vector w can be defined
as a linear combination of the samples in the kernel space (X) with the coefficient
vector , i.e.,
w = (X), (1.5)
6
where (X) = ((x1 ), ..., (xn )) and Rn . Substituting (1.5) into (1.3), we get
=T k(x) + b
n
X
= i hxi , xi + b, (1.6)
i=1
where hxi , xi is the inner product of (xi ) and (x), i.e., hxi , xi = (xi )T (x). We
thus see that the model f (x) just derived is linear in the kernel space but nonlinear
the nonlinearity of the original data is eliminated and a linear approach can be used
(KDA) [5, 67] is one of the most used kernel methods. KDA is a kernel extension
the within-class scatter of the data in the kernel space. Ideally, if the kernel function
and associated parameters are set appropriately, the class distributions will become
homoscedastic in the kernel space and the smallest classification error (i.e., Bayes
error) can be obtained from the resultant linear Bayes classifier. The performance of
Kernel Support Vector Machine (KSVM) [92] is another kernel approach popularly
used in pattern recognition. Unlike the DA-based approach, KSVM does not make
approach which directly maximizes the margin between the samples defining the two
classes. In general, the larger the margin, the better the generalization performance
7
(a) (b) (c)
Figure 1.2: Here we show an example of two non-linearly separable class distributions,
each consisting of 3 subclasses. (a) Classification boundary of LDA. (b) SDAs solution. (c)
KDAs solution.
space, thus requiring different learning models. A kernel mapping can be specified
mappings to the kernel space. For instance, a Gaussian RBF kernel characterizes a
1.3. An appropriately selected kernel function may greatly improve the algorithm
performance. However, one usually does not have any prior knowledge of which
Even when a kernel function is determined, the process of selecting the parameters
of the kernel which map the original nonlinear problem to a linear one still remains
a big challenge. Kernel parameters play a significant role in the kernel mapping
process. Each kernel parameter specifies a model for the problem to be solved. Thus,
desirable that a model could achieve a good bias and variance trade-off, according to
8
(a) (b)
Figure 1.3: Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample similarity on x
is determined by the nearby samples of x. (b) The polynomial kernel. d is the degree of
the kernel. The kernel value measuring the sample similarity on x is determined by all the
samples.
the bias and variance decomposition [42]. If the the model is made too complex, an
over-fitting to the training data may occur, i.e., low bias and high variance. Whereas if
the model is made too simple, it may under-fit the data and will thus not effectively
capture the underlying structure of the data [42], i.e., high bias and low variance.
Unfortunately, without prior knowledge on the data, it is not easy to select good
in kernel methods and propose several novel approaches to address this problem. We
cast the problem into two typical scenarios: classification and regression. In the
kernel methods.
9
1.2 Literature review
Model selection in kernel methods has been a very active and popular research
area. Kernel-based approaches are very powerful due to their high generalization per-
formance and efficiency using the kernel trick. Although promising, a main problem
cannot be circumvented, that is, how to learn a good kernel mapping to adapt to the
performances.
In the literature, various approaches for kernel learning have been proposed. Gen-
erally, they can be divided into three classes. The first class of approaches is to learn
the kernel parameters given a parameterized kernel function. In the second class of ap-
in this class is multiple kernel learning, where some basis kernels are first built and
then the final kernel is constructed as a linear or nonlinear combination of these basis
kernels. In the third class of approaches, instead of using some traditional kernel
function, some new kernel functions are proposed to specifically tackle the problem
at hand and expected to perform better. In the following, we will give a review for
One of the most commonly used kernel parameter selection methods is cross-
validation (CV) technique [88, 42]. In this approach, the training data is divided into
k parts: (k 1) of these are used for training the algorithm with distinct values of the
parameters of the kernel, and the remaining one for validating which of these values
10
results in higher classification or prediction accuracy. This method has four major
k times, and the parameter selection is based on an exhaustive search. Second, only
part of the training data is used in each fold. When doing model selection, one wants
to employ the largest possible number of training samples, since this is known to
yield better generalizations [63]. Third, it only selects the parameters from a set of
discrete values and a careful range of the parameters should be pre-specified. Finally,
the selection of k can be an issue, since it affects the trade-off between bias and
may not capture the underlying structure of the data; if k is large, the model would
have a good chance to overfit the training data and result in a poor generalization
performance.
defined to select the ridge parameter in ridge regression. GCV can be directly ex-
tended to do model selection with kernel approaches, as long as the hat matrix [37],
which projects the original response vector to the estimated one, can be obtained.
CV, where n is the number of training samples), the estimated result generally has
a large variance, i.e., the learned function is highly variant and dependent of the
training data, since in each fold almost the same data is used to train the model.
based on the idea of the between-within class ratio as Fisher had originally proposed
for LDA [28]. Here, we will refer to this as the Fisher criterion. Wang et al. [98] and
Xiong et al.[108] define such a criterion, which maximizes the between-class scatter
11
and minimizes the within-class scatter in the kernel space, to optimize the kernel
parameter. This criterion maximizes the class separability in the kernel space, and
Wang et al. [97] develop another version of the Fisher criterion, defined as the trace
of the ratio between the kernel versions of the between-class scatter matrix and the
within-class scatter matrix (a.k.a. discriminant power). Due to the difficulty of direct
reformulated as a convex optimization problem and then used to find a solution over
a convex set of kernels. Alternatively, Cristianini et al. [17] define the concept of
kernel alignment to capture the agreement between a kernel and the target data. It
is shown how this measure can be used to optimize the kernel. However, Xiong et
al. [108] show that this kernel-target alignment criterion is equivalent to maximizing
the between-class scatter, provided that the kernel matrix has been centralized and
normalized by its Frobenius norm. The major drawback with these criteria is that
they are only based on the measures of class separability. Note that the measure
for class separability is not always related to the classification error. For example,
since the Fisher criterion is based on a least-squares formulation [40], this can easily
over-weight the influence of the classes that are farthest apart [58], i.e., the classifier
for the expected generalization error. Then optimization schemes are used to minimize
such approximations to select the kernel parameters. Cristianini et al. [16] optimize
12
provided by the Vapnik-Chervonenkis (VC) theory. This upper bound depends on
the radius of the smallest ball containing the training set in the feature space and
the margin between the two classes. They propose a method to dynamically adjust
the kernel parameter during the SVM learning process to find the optimal kernel
parameter which provides the best possible upper bound on the generalization error.
Chapelle et al. [11] optimize the kernel parameters by minimizing different upper
almost unbiased estimate of the expected generalization error. The kernel parameters
are optimized by gradient descent methods. However, these approaches have some
limitations. Usually, it is not clear whether these upper bounds are tight enough to
give a good estimate. Moreover, the estimate of the leave-one-out error based on
which bounds are derived may have high variance [42], which may deteriorate the
the marginal data likelihood after reformulating the learning problem as probabilis-
tic models. Well-known approaches in this group are the Relevance Vector Machine
(RVM) [91] and the Gaussian processes [79]. RVM uses Bayesian inference to ob-
tain parsimonious solutions for regression and classification. The learning is based
using the Bayesian inference framework. The hyperparameters used in the mean and
covariance functions can be directly estimated by maximizing the marginal data like-
lihood. Gold and Sollich [35] give a probabilistic interpretation of SVM classification
13
then viewed as the maximum a posteriori (MAP) solution of the corresponding prob-
abilistic inference problem. Then, the kernel parameters in SVM are optimized by
maximizing the data likelihood. Glasmachers and Igel [34] propose a likelihood func-
tion of the kernel parameters to robustly estimate the class conditional probabilities
based on logistic regression, and kernel parameters are optimized by the maximiza-
tion of this likelihood function using gradient ascent. A major drawback of these
posteriors, some approximation of the posteriors has to be made. This turns out
The approaches for kernel parameter learning need to specify a known parame-
terized kernel function. However, given the data at hand, one usually does not have
prior knowledge of which kernel function should be used. Different kernel functions
Rather than learning the kernel parameters of a given kernel function, one could try
to directly learn the kernel matrix, which encodes the similarity of all the training
samples.
Liu et al. [54] propose to learn a (so-called) optimal neighborhood kernel matrix by
assuming that the pre-specified kernel matrix generated from the specific application
is a noisy observation of the ideal one. Kernel learning is then based on minimizing
the difference of the pre-specified kernel matrix and the learned one. Yeung et al.
[112] propose a method for learning the kernel matrix based on maximizing a class
14
separability measure. While a single kernel is known to be not sufficient to describe
the data, multiple kernel learning (MKL) has attracted much attention recently [51,
base kernels and the optimal coefficients can be determined by using semidefinite
convex functions over the convex combination of positive semidifinite matrices. Wang
et al. [101] present an alternative approach to MKL. The input data is first mapped
into m different kernel spaces by m different kernel functions and each generated
kernel space is taken as one view of the input space. Then, by using Canonical
outputs. Yet, the selection of the base kernel functions and associated parameters is
Vector Regression (SVR). The coefficients that determine the combination of kernels
are learned using a constrained quadratic programming problem. This method was
rameters are selected by maximizing the marginal data likelihood after reformulating
the regression problem as probabilistic models using Bayesian inference. This ap-
proach has been used to define the well-known Relevance Vector Machine (RVM) [91]
learning a kernel matrix often scale poorly, with running times that are cubic in the
number of the training samples; thus the application of these algorithms to large-scale
15
data-sets is limited. Moreover, the multiple kernel learning approach suffers from two
while others outperform it in different settings. Second, the kernel matrix can only
be searched within the space defined by these pre-specified functions. If the kernels
and their parameters are not appropriately specified, the learned kernel matrix will
A main issue of kernel methods is the selection of the kernel functions. Each kernel
appropriately selected kernel function for a given problem could result in a substantial
Although the popularly used kernels, such as the Gaussian RBF kernel and the
polynomial kernel, have shown successful performance in some applications, they have
some known limitations. For instance, the input sample should be in a vector form.
and each vector couldhave different lengths. A good example for this type of data
is the protein sequence data. Jaakkola et al. [46] propose a Fisher-based kernel to
detect remote protein homologies. A probabilistic model for each protein sequence is
first built, then the Fisher score, which measures the gradient of the log-likelihood of
the model, is used to represent the sequence sample. Then the similarity between the
two sequences is measured by the inner product of the corresponding Fisher scores. A
good feature of the Fisher kernel is that it combines an underlying generative model
16
and discriminant classifiers (SVM) in the feature space. Similarly, Moreno et al. [69]
distribution, and an intermediate space mapping the object to its probablity density
function (pdf) is constructed. The new kernel is evaluated based on the KL divergence
of the two pdfs. Wolf and Shashua [105] derive a more generic kernel for the instances
defined over a space of sets of vectors. Each sample object (a set of vectors) is viewed
as a linear subspace and the kernel is evaluated by measuring the principal angles
between two linear subspaces. This kernel is successfully applied to face recognition
from video.
Some kernels have been developed to be used in some particular applications. For
instance, Odone et al. [72] propose two kernels which are used for images. The images
are first represented as binary strings and then a kernel, as a similarity measure, is
used to operate on them. They further show that the image similarity measures given
kernels. For text classification, Lodhi et al. [55] propose a string kernel to encode the
similarity between the strings. The kernel is generated by using all the subsequences
of length k. Each subsequence forms a dimension in the feature space and weighted
by an exponentially decaying factor of their full length in the text, thus emphasizing
From the literature review of model selection in kernel methods, several important
questions are raised. First, in classification, the original goal of a kernel methods is to
17
find a mapping such that the samples in the kernel space could be linearly classified.
To our surprise, no approach thus far has explicitly solved this problem. In other
words, the classifier in the kernel space is not ensured to be linear. Thus, our goal is
to define a first criterion for kernel optimization such that the linear classifier in the
role in the regression performance. How to achieve a good balance between the
model fit and model complexity remains a big challenge. We propose an approach
for model selection by adopting multiobjective optimization. By doing so, the model
fit is reduced while the model complexity is kept in check. Finally, in the multiple
specified. Is there a way to learn a kernel matrix without specifying an explicit kernel
in supervised learning. Our approaches are theoretically justified and have been suc-
are as follows:
We develop two criteria to optimize the kernel parameters given a kernel func-
tion based on the idea of Bayes optimality. In the first criterion, kernel pa-
rameters are optimized such that the classification in the kernel space is Bayes
optimal. Thus, this solves the original goal of the kernel mapping: the class
distributions in the kernel space are linearly separable. We achieve this by max-
18
simultaneously in the kernel space. We further relax the single Gaussian as-
of searching for a linear classifier, we directly minimize the Bayes error over all
mate the Bayes accuracy (defined as one minus Bayes error) in the kernel space.
Then optimal kernel is then learned by maximizing this Bayes accuracy over all
kernel representations. Both criteria are shown to outperform the state of the
In this framework, model fit and model complexity in the kernel space are first
show that our approach can not only learn the kernel parameters, but those of
doing so, we eliminate the need for defining a unique way of combining different
The kernel matrices are then iteratively modified until the matrix providing the
19
smallest classification error is obtained. To map test feature vectors, we define
by the selected kernel matrix. We provide comparative results against the state
of the art methods including multiple kernel learning and transductive learning.
The results shows the superiority of the proposed approach. We further extend
kernels measure the sample similarities by taking into account local density
adaptively vary for different local regions based on a measure of the weighted
local variance. Also, the shape varies in an implicit way such that they are
kernels are shown to perform better than the traditional fixed-shape kernels like
The rest of this dissertation is organized as follows. The first two criteria are
matrix for both classification and regression. Conclusions and future work are given
in Chapter 6.
20
CHAPTER 2
2.1 Introduction
Discriminant Analysis (DA) is one of the most popular approaches for feature ex-
traction with broad applications in, for example, computer vision and pattern recog-
nition [32], gene expression analysis [63] and paleontology [62]. The problem with
DA algorithms is that each of them makes assumptions on the underlying class dis-
tributions. That is, they assume the class distributions are homoscedastic, i = j ,
i, j. This is rarely the case in practise. To resolve this problem, one can first map
the original data distributions (with unequal covariances) into a space where these
become homoscedastic. This mapping may however result in a space of very large
dimensionality. To prevent this, one usually employs the kernel trick [84, 96]. In the
kernel trick, the mapping is only intrinsic, yielding a space of the same dimensionality
as that of the original representation while still eliminating the nonlinearity of the
data by making the class distributions homoscedastic. This is the underlying idea in
The approach described in the preceding paragraph resolves the problem of nonlin-
early separable Normal distributions, but still assumes each class can be represented
by a single Normal distribution. In theory, this can also be learned by the kernel,
21
since multimodality includes nonlinearities in the classifier. In practise however, it
makes the problem of finding the appropriate kernel much more challenging. One way
to add flexibility to the kernel is to allow for each class to be subdivided into several
subclasses. This is the underlying idea behind Subclass DA (SDA) [116]. However,
while SDA resolves the problem of multimodally distributed classes, it assumes that
these subclass divisions are linearly separable. Note that SDA can actually resolve
that results in linearly separable subclasses yielding a non-linear classifier. The ap-
proach will fail when there is no such division. To resolve this problem, we require to
derive a subclass-based approach that can deal with nonlinearly separable subclasses
[12]. This can be done with the help of a kernel map. In this approach, we need
to find a kernel which maps the subclass division into a linearly separable set. We
refer to this approach as Kernel SDA (KSDA). Note that KSDA has two unknowns
the number of subclasses and the parameter(s) of the kernel. Hence, finding the
appropriate kernel parameters will generally be easier, a point we will formally show
The kernel parameters are the ones that allow us to map a nonlinearly separable
problem into a linear one [84]. Surprisingly, to the best of our knowledge, there is
not a single method in kernel DA designed to find the kernel parameters which map
the problem to a space where the class distributions are linearly separable. To date,
the most employed technique is k-fold cross-validation (CV). In CV, one uses a large
percentage of the data to train the kernel algorithm. Then, we use the remaining
(smaller) percentage of the training samples to test how the classification varies when
we use different values in the parameters of the kernel. The parameters yielding the
22
highest recognition rates are kept. More recently, [98, 49] showed how one can employ
the Fisher criterion (i.e., the maximization of the ratio between the kernel between-
class scatter matrix and the kernel within-class scatter matrix) to select the kernel
training set. However, neither of them aims to solve the original goal of the kernel
map to find a space where the class distributions (or the samples of different classes)
can be separated linearly. Moreover, the Fisher criterion is based on the measures
of class separability. Note that the measure for the class separability is not always
In this chapter, we propose two approaches to learn the kernel parameters given
the original class (or subclass) distributions into a kernel space where these are best
maximize the distance between the distributions of different classes, thus maximizing
generalization. We apply the derived approach to three kernel versions of DA, namely
LDA, Nonparametric DA (NDA) and SDA. We show that the proposed techniques
generally achieves higher classification accuracies than the CV and Fisher criteria
defined in the preceding paragraph. In the second approach, we derive a criterion for
selecting the parameters by minimizing the Bayes classification error. To achieve this,
we define a function measuring the Bayes accuracy (i.e., one minus the Bayes error)
in the kernel space. We then show how this function can be efficiently maximized
using gradient ascent. It should be emphasized that this objective function directly
minimizes the classification error, which makes the proposed criterion very powerful.
We will also illustrate how we can employ the same criterion for the selection of other
23
parameters in discriminant analysis. In particular, we demonstrate the uses of the
derived criterion in the selection of the kernel parameters and the number of subclasses
formulation of DA common to most variants. We also derived kernel versions for NDA
and SDA.
cally, its advantage over unsupervised techniques is given by it providing that repre-
sentation where the underlying class distributions are best separated. Unfortunately,
due to the number of possible solutions, this goal is not always fulfilled in practice
[63]. With infinite time or computational power, one could always find the optimal
for all the possible linear combinations of features, let alone a set of nonlinear com-
binations. This means that one needs to define criteria that can find an appropriate
The least-squares extension of Fishers criterion [28, 32] is arguably the most
known. In this solution, LDA employs two symmetric, positive semi-definite matri-
ces, each defining a metric [63]. One of these metrics should measure within-class
differences and, as such, should be minimized. The other metric should account for
between-class dissimilarity and should thus be maximized. Classical choices for the
first metric are the within-class scatter matrix SW and the sample covariance matrix
X , while the second metric is usually given by the between-class scatter matrix SB .
Pn
The sample covariance matrix is defined as X = n1 i=1 (xi ) (xi )T , where
24
Pn
X = {x1 , . . . , xn } are the n training samples, xi Rp , and = n1 i=1 xi is the sam-
PC
ple mean. The between-class scatter matrix is given by SB = i=1 pi (i ) (i )T ,
P ni
where i = n1
i j=1 xij is the sample mean of class i, xij is the j th sample of class i,
is the prior of class i. LDAs solution is then given by the generalized eigenvalue de-
composition equation 1
X SB V = V, where the columns of V are the eigenvectors,
To loosen the parametric restriction on the above defined metrics, Fukunaga and
Mantock defined NDA [31], where the between-class scatter matrix is changed to a
PC PC Pni
non-parametric version, Sb = i=1 j=1 l=1 ijl (xil jil )(xil jil )T , where jil is the
j6=i
sample mean of the k-nearest samples to the samples xil that do not belong to class i,
and ijl is a scale factor that deemphasizes large values (i.e. outliers). Alternatively,
allowing for the minimization of the generalization error. This regularizing parame-
ter can be learned using CV, yielding the method Regularized DA (RDA). Another
variant of LDA is given by Loog et al. [58], who introduced a weighted version of
the metrics in an attempt to downplay the roles of the class distributions that are
farthest apart. More formally, they noted that the above introduced Fisher criterion
1
PC1 PC T T
for LDA can be written as i=1 j=i+1 pi pj ij tr V SW V V ij V , where
et al. suggest to make these weights inverse proportional to their pairwise accuracy
(defined as one minus the Bayes error). Similarly, we can define a weighted version
PC P nc P nc
of the within-class scatter matrix SW = c=1 k=1 l=1 ckl(xck xcl )(xck xcl )T .
25
In LDA, ckl are all equal to one. In its weighted version, ckl are defined ac-
cording to the importance of each sample in classification. Using the same no-
that in these two definitions, the priors have been combined with the weights to
All the methods introduced in the preceding paragraphs assume the class distribu-
tions are unimodal Gaussians. To address this limitation, Subclass DA (SDA) [116]
C1
XX Hi Hk
C X
X
B = pij pkl (ij kl )(ij kl )T , (2.1)
i=1 j=1 k=i+1 l=1
where pij = nij /n is the prior of the j th subclass of class i, nij is the number of
The algorithms summarized thus far assume the class (or subclass) distributions
tions, [57] defines a within-class similarity metric using the Chernoff distance, yielding
use an embedding approach such as Locality Preserving Projection (LPP) [43]. LPP
finds that subspace where the structure of the data is locally preserved, allowing for
function which intrinsically maps the original data distributions to a space where
these adapt to the assumptions of the approach in use. KDA [67, 5] redefines the
26
within- and between-class scatter matrices in the kernel space to derive feature ex-
traction algorithms that are nonlinear in the original space but linear in the kernel
ple covariance and between-class scatter matrices in the kernel space are given by
Pn PC T
X = n
1
i=1 ((xi ) )((xi ) )T and S
B = i=1 pi i i ,
Pn P ni
where = 1
n i=1 (xi ) is the kernel sample mean, and i = 1
ni j=1 (xij ) is the
one generally uses the kernel trick, which works as follows. Let A and B be
two metrics in the kernel space and V the projection matrix obtained by A V =
B V . We know from the Representers Theorem [96] that the resulting projection
matrix can be defined as a linear combination of the samples in the kernel space (X)
with the coefficient matrix , i.e., V = (X). Hence, to calculate the projection
A = (X)T A (X) and B = (X)T B (X) are the two metrics that need to be
where K = (X)T (X) is the kernel (Gram) matrix and Pn is the n n matrix with
1 Pni T
ni j=1 ((xij )i )((xij )i ) is the kernel within-class covariance matrix of class
i, and Ki = (X)T (Xi ) is the subset of the kernel matrix for the samples in class i.
PC
The metric for S
B can be obtained as AS
B
= i=1 pi (Ki1ni K1n )(Ki 1ni K1n )T ,
where 1ni is a vector with all elements equal to 1/ni . The coefficient matrix for KDA
27
is given by B1
KDA AKDA KDA = KDA KDA , where BKDA can be either B
X
or BSW
We can similarly derive kernel approaches for the other methods introduced above.
For example, in Kernel NDA (KNDA), the metric A is obtained by defining its
AKN DA = (X)T S
b (X)
C X
X ni
C X
= ijl (kil Mjil 1k )(kil Mjil 1k )T ,
i=1 j=1 l=1
j6=i
where kil = (X)T (xil ) is the kernel space representation of the sample xil , Mjil =
(X)T (Xjil ) is the kernel matrix of the k-nearest neighbors of xil , Xjil is a matrix
whose columns are the k-nearest neighbors of xil , and ijl is the normalizing factor
[12]. This matrix is given by replacing the subclass means of (2.1) with the kernel
Pnij
subclass means ij = n1
ij k=1 (xijk ). Now, we can use the kernel trick to obtain
C1
XX Hi Hk
C X
X
pij pkl (Kij 1ij Kkl 1kl )(Kij 1ij Kkl 1kl )T ,
i=1 j=1 k=i+1 l=1
where Kij = (X)T (Xij ) is the kernel matrix of the samples in the j th subclass of
class i, and 1ij is a nij 1 vector with all elements equal to 1/nij .
approach to determine the parameters of the kernel is CV, where we divide the train-
ing data into k parts: (k 1) of them for training the algorithm with distinct values
28
for the parameters of the kernel, and the remaining one for validating which of these
values results in higher (average) classification rates. This solution has three major
drawbacks. First, the kernel parameters are only optimized for the training data, not
very demanding for large data-sets. Third, not all the training data can be used to op-
timize the parameters of the kernel. To avoid these problems, [98] defines a criterion
to maximize the kernel between-class difference and minimize the kernel within-class
scatter as Fisher had originally proposed but now applied to the selection of the
kernel parameters. This method was shown to yield higher classification accuracies
proposed a kernel version of RDA where the kernel is learned as a linear combination
in the kernel space. This would be ideal, because it would guarantee that the Bayes
classifier (which is the one with the smallest error in that space) is linear.
The goal of the first criterion is to find a kernel which maps the original class
distributions to homoscedastic ones while keeping them as far apart from each other
as possible. This criterion is related to the approach presented in [41] where the goal
29
The criterion we derive here could be extended to work in the complex sphere and is
which is maximized when all class covariances are identical. The value of the criterion
should also decrease as the distributions become more different. We now present a
Theorem 1. Let
i and j be the kernel covariance matrices of two Normal dis-
tr(
i j )
tributions in the kernel space defined by the function (.). Then, Q1 = tr(
2 2
i )+tr(j )
Proof.
i and j are two p p positive semi-definite matrices with spectral decom-
T
positions
i = Vi i Vi , where Vi = vi1 , . . . , vip and i = diag i1 , . . . , ip
kernel. For a fixed kernel (and fixed kernel parameters), its value is constant regardless
2 2 2 2
of any divergence between
i and j . Hence, tr(i )+tr(j ) = tr(i )+tr(j ).
T
We also know that tr(
i j ) tr(i j ), with the equality holding when Vi Vj =
30
Now, let us define every eigenvalue of
i as a multiple of those of j , i.e.,
From the above equation, we see that Q1 0, since all its variables are positive.
The maximum value of Q1 will be attained when all km = 1, which yields Q1 = .5.
We now note that having all km = 1 implies that the eigenvalues of the two covariance
matrices are the same. We also know that the maximum of Q1 can only be reached
when the eigenvectors are the same and in the same order, as stated above. This
means that the two Normal distributions are homoscedastic in the kernel space defined
From the above result, we see that we can already detect when two distributions
are homoscedastic in a kernel space. This means that for a given kernel function,
we can find those kernel parameters which give us Q1 = .5. Note that the closer we
get to this maximum value, the more similar the two distributions ought to be, since
their eigenvalues will become closer to each other. To show this, we would now like
to prove that when the value of Q1 increases, then the divergence between the two
distributions decreases.
tween samples from convex sets is the Bregman divergence [8]. Formally, for a given
31
where X, Y {Z | Z Rpp , and Z = ZT }, and is the gradient.
Note that the definition given above for the Bregman divergence is very general. In
fact, many other divergence measures (such as the Kullback-Leibler) as well as several
commonly employed distances (e.g. Mahalanobis and Frobenius) are a particular case
of Bregmans. Consider the case where G(X) = tr(XT X), which computes the trace of
the covariance matrix, i.e., the Frobenius norm. In this case, the Bregman divergence
matrices of the two distributions that we wish to compare. We can also rewrite this
2 2
BG (
1 , 2 ) = tr(1 ) + tr(2 ) 2tr(1 2 ),
Note that to decrease the divergence (i.e., the value of BG ), we need to minimize
2 2
tr(
1 ) + tr(2 ) and/or maximize tr(1 2 ). The more we lower the former and
increase the latter, the smaller the Bregman divergence will be. Similarly, when we
2 2
decrease the value of tr(
1 ) + tr(2 ) and/or increase that of tr(1 2 ), we make
the value of Q1 larger. Hence, as the value of our criterion Q1 increases, the Bregman
divergence between the two distributions decreases, i.e., the two distributions become
more alike. This result is illustrated in Fig. 2.1. We can formally summarize this
result as follows.
BG (
1 , 2 ) between the two kernel covariance matrices 1 and 2 , where G(X) =
tr((X)T (X)).
32
0.5o
0.45 o
0o 45
0.4
90o
o
Q
1
0.35
0.3 o
0.25
0 20 40 60 80
Figure 2.1: Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale and rotation.
(a) The value of Q1 decreases as the angle increases. The 2D rotation between the two
distributions is in the x axis. The value of Q1 is in the y axis. (b) When = 0o , the
two distributions are homoscedastic, and Q1 takes its maximum value of .5. Note how for
distributions that are close to homoscedastic (i.e., 0o ), the value of the criterion remains
high. (c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3.
We have now shown that the criterion Q1 increases as any two distributions be-
come more similar to one another. We can readily extend this result to the multiple
distribution case,
C1 C
2 X X tr(
i k )
Q1 () = 2 2
, (2.3)
C(C 1) i=1 k=i+1 tr(i ) + tr(k )
where
i is the sample covariance matrix of the i
th
class. This criterion measures
This criterion can be directly used in KDA, KNDA and others. Moreover, the
Hi X Hk
1 C1
XX C X
tr(
ij kl )
Q1 (, H1, . . . , HC ) = 2 2
,
h i=1 j=1 k=i+1 l=1 tr(ij ) + tr(kl )
where
ij is the sample covariance matrix of the j
th
subclass of class i, and h is the
33
The reason we needed to derive the above criterion is because, in the multi-class
case, the addition of the Bregman divergences would cancel each other out. Moreover,
It may now seem that the criterion Q1 is ideal for all kernel versions of DA. To
study this further, let us define a particular kernel function. An appropriate kernel
is the RBF (Radial Basis Function), because it is specifically tailored for Normal
the Bayes classifier is linear in this RBF kernel space, it does not guarantee that
the class distributions will be separable. In fact, it can be shown that Q1 may
favor a kernel map where all (sub)class distributions become the same, i.e., identical
covariance matrix and mean. Indeed a particular but useless case of homoscedasticity
in classification problems.
2
Theorem 3. The RBF kernel is k(xi , xj ) = exp kxi x
jk
, with scale parameter
. In the two class problem, C = 2, let the pairwise between class distances be
{D11 , D12 , . . . , Dn1 n2 }, where Dij = kxi xj k22 is the (squared) Euclidean distance
calculated between two sample vectors, xi and xj , of different classes, and n1 and
n2 are the number of elements in each class. Similarly, let the pairwise within class
distances be {d111 , d112 , . . . , d1n1n1 , d211 , d212 , . . . , d2n2 n2 }, where dckl = kxck xcl k22 is the
Euclidean distances between sample vectors of the same class c. And, use SW with
34
Q1
Then, if tr(SB ) > tr(SW ), Q1 (.) monotonically increases with , i.e.,
0.
Proof. Note that both of the numerator and denominator of Q1 can be written in
P P
the form of i j exp (2kxi xj k22 /). Its partial derivative with respect to is,
P P 2kxi xj k22 Q1
i j 2
exp (2kxi xj k22 /). Substituting for Dij and dkl , we have
equal
to
Pn1 Pn2 P2 P nc P nc
2dckl
i=1 j=1 exp 2Dij 2Dij
2 c=1 k=1 l=1 exp
hP P nc P nc i2
2 2dckl
c=1 k=1 l=1 exp
P2 Pnc Pnc Pn1 Pn2
2dckl 2dckl 2Dij
c=1 k=1 l=1 exp 2 i=1 j=1 exp
hP P nc P nc 2dckl
i2 .
2
c=1 k=1 l=1 exp
The left hand side of this inequality is the estimate of the between class variance,
while the right hand side is the within class variance estimate, since Dij and dcij can
for the above defined ckl and 1i2j , we have Q1 / 0 when tr(SB ) > tr(SW ).
Q1
This latest theorem shows that when approaches infinity,
approaches zero
and, hence, Q1 tends to its maximum value of .5. Increasing to infinity in the RBF
kernel will result in a space where the two class distributions become identical. This
will happen whenever tr(SB ) > tr(SW ). This is a fundamental theorem of DA because
it shows the relation between KDA, the weighted LDA version of [58] and the NDA
method of [31]. Theorem 3 shows that these variants of DA are related to the idea
35
importance of the metrics in weighed LDA and NDA. In particular, the above result
proves that if, after proper normalization, the between class differences are larger
than the within class differences, then classification in the kernel space optimized
with Q1 will be as bad as random selection. One indeed wants the class distributions
to become homoscedastic in the kernel space, but not at the cost of classification
given by the trace of the between-class (or -subclass) scatter matrix, since this is
36
In KSDA, we optimize the number of subclasses and the kernel as
Also, recall that in KSDA (as in SDA), we need to divide the data into subclasses. As
stated above we assume that the underlying class distribution can be approximated
vectors and xk is the k 1th feature vector closest to x1 . This ordering allows us to
c into H parts. This
divide the set of samples into H subgroups, by simply dividing X
approach has been shown to be appropriate for finding subclass divisions [116].
As a final note, it is worth emphasizing that, as opposed to CV, the derived crite-
rion will use the whole data in the training set for estimating the data distributions
because there is no need for a verification set. With a limited number of training sam-
ples, this will generally yield better estimates of the unknown underlying distribution.
The other advantage of the derived approach is that it can be optimized using gra-
advantage of this method is that it has a fast converge and does not require the
calculation of the Hessian matrix. Instead, the Hessian is updated by analyzing the
gradient vectors. The derivation of the gradient of our criterion is shown in the sec-
tion to follow. The initial value for the kernel parameter is set to be the mean of the
37
2.3.2 Derivation of the Gradient
kxi xj k2
We take (.) to be the RBF function, k(xi , xj ) = exp
, with the
parameter to be optimized. And, we consider the case where each class distribution
is modeled by a single Gaussian distribution. The derivations for the subclass case
The gradient of our criterion Q(.), when considering the RBF kernel, is given by
Note that T
i = (Xi )(I 1ni )(Xi ) , where (Xi ) = ((xi1 ), ..., (xini )) and 1ni
1ni )(Xi )T (Xk )(I 1nk )(Xk )T ) = tr(Kki (I 1ni )Kik (I 1nk )), where Kik =
(Xi )T (Xk ). Let Kki = Kki(I 1ni ) and Kik = Kik (I 1nk ). We can rewrite this
result as,
XX
tr(KkiKik ) = Kpq qp
ki Kik ,
p q
3
exp( kxp2
xq k
2 ) when using the RBF function. Then,
tr(
i k ) tr(Kki Kik )
=
X X Kpq qp !
ki qp pq Kik
= Kik + Kki
p q
" ! !#
XX Kpq
ki Kpq
ki qp
qp
pq Kik Kqp
ik
= 1i Kik + Kki 1k .
p q
38
Next, note that Q2 () can be written as
C1
X C
X
Q2 () = pi pk dik ,
i=1 k=i+1
where
dik = (i k )T (i k )
that the solution found with such a gradient descent technique is an appropriate
one, recall that Theorem 2 showed Q1 monotonically increases if tr(SB ) > tr(SW ).
In most practical problems this condition is satisfied, since otherwise the classes
mostly overlap and the classification problem is not solvable (i.e., there is a very large
classification error in the original feature space). This means there is an identifiable
global maximum. We now note that the same applies to Q2 . That is, as long as the
class distributions do not overlap significantly, Q2 has a unique maximum for a sigma
value in between the averaged within class sample distances and the averaged between
class sample distances. To see this, note that for every Q2 calculated for a pair of
classes (i.e., classes 1 and 2), there are three main components: the sum of the kernel
matrix elements in class 1, in class 2, and between classes 1 and 2. Each of these
components monotonically increases with respect to sigma (starting with 1/n1 , 1/n2 ,
0, and converging to 1). The fastest increases occur for sigma around the averaged
39
distance in that component; e.g., for within class 1, this will be around the averaged
distance of the samples in that class. This means that the within class components
will converge earlier than the between class distances. Hence, the sum of the within
class subtracted with two times the between class elements (in the kernel matrix)
will result in a maximum in between the averaged within class sample distances and
In some applications where our conditions may not hold, it would be appropriate
to test a few starting values to determine the best solution. We did not require this
2.3.3 Generalization
small generalization error, i.e., small expected error on the unobserved data. This
mainly depends on the number of samples in our training set, training error and the
model (criterion) complexity [42]. Since the training set is usually fixed, we are left
to select a proper model. Smooth (close to linear) classifiers have a small model com-
plexity but large training error. On the other hand wiggly classifiers may have a small
training error but large model complexity. To have a small generalization error, we
need to select a model that has moderate training error and model complexity. Thus,
in general, the simpler the classifier, the smaller the generalization error. However, if
the classifier is too simple, the training error may be very large.
KDA is limited in terms of model complexity. This is mainly because KDA as-
structure in each class, KDA would select wiggly functions in order to minimize the
40
(a) (b) (c) (d)
Figure 2.2: Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions that have small
training error. (c) However, when the model complexity is small, = 3, KDA fails. (d)
KSDAs solution resolves this problem with piecewise smooth, nonlinear classifiers.
classification error. To avoid this, the model complexity may be limited to smooth
solutions, which would generally result in large training errors and, hence, large gen-
eralization errors.
class representations, e.g., KSDA. While KDA can find wiggly functions to separate
multimodal data, KSDA can find several functions which are smoother and carry
smaller training errors. We can illustrate this theoretical advantage of KSDA with a
simple 2-class classification example, Fig. 2.2. In this figure, each class consists of 2
nonlinearly separable subclasses. Fig. 2.2(a) shows the solution of KDA obtained with
the RBF kernel with = 1. Fig. 2.2(b) shows the KSDA solution. KSDA can obtain a
classification function that has the same training error with smaller model complexity,
increasing to 3, KDA leads to a large training error, Fig. 2.2(c). This does not
occur in KSDA, Fig. 2.2(d). A similar argument can be used to explain the problems
faced with Maximum Likelihood (ML) classification when modeling the original data
as a Mixture of Gaussians (MoG) in the original space. Unless one has access to a
41
Figure 2.3: The original data distributions are mapped to different kernel spaces via dif-
ferent mapping functions (.). (2 ) is better than (1 ) in terms of the Bayes error.
sufficiently large set (i.e., proportional to the number of dimensions of this original
The second criterion we will define in this chapter is directly related to the concept
of Bayes classification error. The idea is to learn the kernel parameters by finding a
kernel representation where the Bayes classification error is minimized across all the
mappings. This is illustrated in Figure 2.3. We start with an analysis of LDA. One
of the drawback of LDA is that its solution is biased toward those classes that are
furthest apart. To see this, note that LDA is based on least-squares (i.e., an eigenvalue
42
LDA solution tends to over-weight the classes that were already well-separated in
the original space. In order to downplay the roles of the class distributions that are
where 2ij = (i j )T 1
X (i j ) is the Mahalanobis distance between classes i
and j, : R+ +
0 R0 is a weighting function, (ij ) =
1
22ij
erf ( 2ij2 ), and erf (x) =
Rx 2
2 et dt is the error function.
0
One advantage of (2.6) is that it is related to the mean pairwise Bayes accuracy
(i j )(i j )T are the pairwise class distances, and, for simplicity, we have
can be employed to improve LDA [58]. We want to derive a similar function for its
Let (.) : Rp F be a function defining the kernel map. We also assume the data
has already been whitened in the kernel space. Denote the data matrix in the kernel
space (X), where (X) = ((x11 ), . . . , (xini ), . . . , (xCnC )). The kernel matrix is
Using this notation, the covariance matrix in the kernel space can be written as
1 T
X = n (X)(In Pn )(X) , where In is the n n identity matrix, and Pn is a
43
n n matrix with all elements equal to 1/n. The whitened data matrix (X) is now
1 T
given by (X) = 2 V (X), where and V are the eigenvalue and eigenvector
matrices given by
X V = V . We know from the Representers Theorem [96]
that a projection vector lies in the span of the samples in the kernel space (X), i.e.,
1 T
(X) = 2 V (X)
1 1
= 2 T (X)T (X) = 2 T K,
covariance matrix
X into the identity matrix.
where (Xi ) = ((xi1 ), . . . , (xini )), and 1i is a ni 1 vector with all elements equal
to 1/ni . Let Ki = (X)T (Xi) denote the subset of the whitened kernel matrix for
Combining the above results, we can define the Bayes accuracy in the kernel space
as
d C1
X X X C
T
Q() = pi pj (
ij )em Sij em , (2.9)
m=1 i=1 j=i+1
where e1 , ..., ed are the eigenvectors of the weighted kernel between-class scatter ma-
trix
C1
X C
X
pi pj (
ij )Sij ,
i=1 j=i+1
44
T
S
ij = (i j )(i j ) , the Mahalanobis distance ij in the whitened kernel space
2
ij = (i j )T (i j )
and Kij = (Xi )T (Xj ) is the subset of the kernel matrix for the samples in class i
and j.
From the Representers Theorem [96], we know that ei = (X)ui , where ui is a co-
T
efficient vector. Then, using (2.8) we have em S T
ij em = um Sij um , where Sij = (Ki 1i
PC1 PC
Kj 1j )(Ki 1i Kj 1j )T , and u1 , . . . , ud are the eigenvectors of i=1
j=i+1 pi pj (ij )Sij .
We will refer to the derived criterion given in (2.11) as Kernel Bayes Accuracy
(KBA) criterion.
The first application of the above derived criterion is in determining the value
of the parameters of a kernel function. For example, if we are given the Radial
kxi xj k2
Basis Function (RBF) kernel, k(xi , xj ) = exp( 22
), our goal is to determine an
45
To determine our solution, we employ a quasi-Newton method with a Broyden-
To compute the derivative of our criterion, note that (2.11) can be rewritten as
C1
X C
X
Q() = tr( pi pj (
ij )Sij )
i=1 j=i+1
C1
X C
X
= pi pj (
ij )tr(Sij ).
i=1 j=i+1
Q()
Taking the partial derivative with respect to in the RBF kernel, we have
=
PC1 PC (
ij ) tr(Sij )
i=1 j=i+1 pi pj
tr(Sij ) + (
ij ) .
K
Denote the partial derivative of an m n matrix K with respect to as
=
h i (
Kij K k(xi ,xj ) kx x k2 kx x k2 ij )
i=1,...,m,j=1,...,n
, with ij =
= i 3 j exp( i22j ). Then =
2
erf (
ij /2 2) ij exp(
ij /8) ij K K
3
+
2 2 2
, where
ij
= 21 (1Ti
Kii
1i 21Ti ij 1j +1Tj jj 1j ).
ij ij ij
of subclasses in Subclass Discriminant Analysis (SDA) [116] and its kernel extension.
KDA assumes that each class has a single Gaussian distribution in the kernel space.
However, this may be too restrictive since it is usually difficult to find a kernel rep-
resentation where the class distributions are single Gaussians. In order to relax this
assumption, we can describe each class using a mixture of Gaussians. Using this idea,
( T
ij,kl )um Sij,kl um , (2.12)
46
where Hi is the number of subclasses in class i, u1 , ..., ud are d eigenvectors of the
C1
XX Hi Hk
C X
X
pij pkl (
ij,kl )Sij,kl ,
i=1 j=1 k=i+1 l=1
Sij,kl = (Mij 1ij Mkl 1kl )(Mij 1ij Mkl 1kl )T , Mij = (X)T (Xij ), (Xij ) = ((xij1 ), . . . ,
(xijnij )), xijk is the k th sample of subclass j in class i, 1ij is a nij 1 vector with
all elements equal to 1/nij , and nij the number of samples in the j th subclass of class
i. Note that in the above equation, the whitened Mahalanobis distance is given by
2 T
ij,kl = (ij kl ) (ij kl )
where Kij,kl = (Xij )T (Xkl ). The optimal kernel function and subclass divisions
are given by
In KSDA we are simultaneously optimizing the kernel parameter and the number
the Bayes optimal solution when the classes need to be described with a mixture
underlying structure of the data. This last point is important in many applications.
In our case study, we generated a set of 120 samples for each of the two classes.
Each class was represented by a mixture of two Gaussians, with mean and diagonal
47
covariance randomly initialized. Then, (2.12) was employed to determine the ap-
propriate number of subclasses and parameter of the RBF kernel. This process was
repeated 100 times, each with a different random initialization of the means and co-
variances. The average of the maxima of (2.12) for each value of Hi (with H1 = H2 )
are shown in Fig. 2.4(a). We see that the derived criterion is on average higher for
the correct number of subclasses. We then repeated the process described in this
paragraph for the cases of 3, 4 and 5 subclasses per class. The results are in Fig.
2.4(b-d). Again, the maximum of (2.12) corresponds to the correct number of sub-
discover the underlying structure of the data. For comparison, in Fig. 2.4(e-h) we
show the plots of the Fisher criterion described earlier. We see that this criterion does
not recover the correct number of subclasses and is generally monotonically increas-
ing, thus, tending to select large values for Hi . This is because the Fisher criterion
maximizes the between-subclass scatter and, generally, the larger Hi , the larger the
scatter.
As a more challenging case, we also consider the well-known XOR data classifi-
cation problem, Fig. 2.5(a). The values of (2.12) for different Hi are plotted in Fig.
2.5(b) and those of the Fisher criterion in (c). Once more, we see that the KBA
48
0.3 0.3 0.25 0.25
0.25 0.2 0.2
0.2 0.15 0.15
KBA
KBA
KBA
KBA
0.2
0.1 0.1 0.1
0.15 0.05 0.05
0.1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i
Fisher
Fisher
Fisher
1.5 0.5 0.5
0.5
1
0.5 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i
Figure 2.4: Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4, and (d,h) 5. The
x-axis specifies the number of subclasses Hi . The y-axis shows the value of the criterion
given in (2.12) in (a-d) and of the Fisher criterion in (e-h).
0.5
4
0.35 x 10
12
0 0.3 10
X2
0.25
Fisher
8
(7)
0.2 6
0.5
0.15 4
0.5 0 0.5 0.1 2
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
X1 H H
i i
Figure 2.5: (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion.
49
2.5 Experimental Results
In this section, we will use our homoscedastic criterion to optimize the kernel
parameter of KDA, KNDA and KSDA. We will give comparative results with CV,
the Fisher criterion of [98], the use of the Bregman divergence, and other nonlinear
methods Kernel PCA (KPCA), HLDA and LPP and related linear approaches
LDA, NDA, RDA, SDA, and aPAC. The dimensionality of the reduced space is
taken to be the rank of the matrices used by the DA approach and to keep 90% of the
variance in PCA and KPCA. We also provide comparisons with Kernel Support Vector
Machines (KSVM) [93] and the use of ML in MoG [65], two classical alternatives for
nonlinear classification.
The first five data-sets are from the UCI repository [7]. The Monk problem is given
by a 6-dimensional feature space defining six joints of a robot and two classes. Three
different case scenarios are considered, denoted Monk 1, 2 and 3. The Ionosphere
set corresponds to satellite imaging for the detection of two classes (structure or not)
in the ground. And, in the NIH Pima set, the goal is to detect diabetes from eight
measurements.
We also use the ETH-80 [53] database. It includes a total of 3, 280 images of the
following 8 categories: apples, pears, cars, cows, horses, dogs, tomatoes and cups.
Each category includes 10 objects (e.g., ten apples), Figure 2.6 . Each of the (80)
objects has been photographed from 41 orientations. We resized all the images to
25 30 pixels. The pixel values in their vector form (x R750 ) are used in the
50
(a)
(b)
Figure 2.6: Shown here are (a) 8 categories in ETH-80 database and (b) 10 different objects
for the cow category.
the leave-one-object-out test. That is, the images of 79 objects are used for training,
those of the remaining object for testing. We test all options and calculate the average
recognition rate.
We also use 100 randomly selected subjects from the AR face database [61]. All
images are first aligned with respect to their eyes, mouth and jaw line before cropping
and resizing them to a standard size of 29 21 pixels. This database contains images
of two different sessions, each taken two weeks apart. The images in the first and
second session contain the same facial expressions and occlusions and were taken
under the same illumination conditions. We use the images in the first session for
We also use the Sitting Posture Distribution Maps data-set (SPDM) of [117].
Here, samples were collected using a chair equipped with a pressure sensor sheet
located on the sit-pan and back-rest. The pressure maps provide a total of 1, 280
51
provided five samples of each of the ten different postures. Our goal is to classify
each of the samples into one of the ten sitting postures. This task is made difficult by
the nonparametric nature of the samples in each class [117]. We randomly selected 3
samples from each individual and posture for training, and used the rest for testing.
of [52] is a large collection of various sets of handwritten digit (0-9). The training
set consists of 60,000 samples. The test set has 10,000 samples. All the digits have
with 3,000 samples in each class. This is done to reduce the size of the Gram matrix,
As defined above, we employe the RBF kernel. The kernel parameter in KPCA
is optimized with CV. CV is also used in KDA, KNDA and KSDA, denoted: KDACV ,
KNDACV and KSDACV . The kernel parameter is searched in the range [m 2st, m +
2st], where m and st are the mean and standard deviation of the distances between all
pairwise training samples. We use 10-fold cross validation in the UCI data-sets and
5-fold cross validation in the others. In KNDA and KSDA, the number of nearest
neighbors and subclasses are also optimized. In KSDA, we test partitions from 1
approach of [98], denoted: KDAF , KNDAF and KSDAF . The two parameters of
LPP (i.e., the number of nearest neighbors, and the heat kernel) are optimized with
KDAH , KNDAH and KSDAH . The same algorithms optimized using Bregman are
52
Results
The algorithms summarized above are first employed to find the subspace where
the feature vectors of different classes are most separated according to the algorithms
In our first experiment, we use the nearest mean (NM) classifier. The NM is
an ideal classifier because it provides the Bayes optimal solution whenever the class
distributions are homoscedastic Gaussians [32]. Thus, the results obtained with the
NM will illustrate whether the derived criterion has achieved the desirable goal. The
results are shown in Table 2.1. We see that the kernel algorithms optimized with the
To further illustrate this point, the table includes a rank of the algorithms following
the approach of [20]. As predicted by our theory, the additional flexibility of KSDA
Our second choice of classifier is the classical nearest neighbor (NN) algorithm.
Its classification error is known to be less than twice the Bayes error. This makes it
appropriate for the cases where the class distributions are not homoscedastic. These
results are in Table 2.2. A recently proposed classification algorithm [75] emphasizes
on the approximation of the nonlinear decision boundary using the sample points
Tikhonov regularization. Since our criterion is used to make the classifier in the kernel
space as linear as possible, smooth (close to linear) classifiers are consistent with this
goal and should generally lead to better results. We present the results obtained with
53
Table 2.1: Recognition rates (in percentages) with nearest mean
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 82.6* 73.5 61.7 77.4 82.6* 81.6 61.7 71.6 76.2 74.6 65.6 73.6
AR database 88.1* 78.2 65.5 84.2 87.5* 86.7 69.5 84.2 71.3 61.4 72.5 74.3
SPDM 84.6* 80.1 67.9 83.9* 84.6* 83.2 67.9 83.3 82.4 82.9 53.4 75.6
Monk1 88.2* 85.0 71.1 88.0* 84.0 89.6* 65.3 83.1 70.1 65.7 50.0 63.4
Monk2 76.6 82.2* 56.7 74.5 80.1 75.2 55.6 70.1 73.5 64.8 61.8 71.8
Monk3 96.3* 88.7 85.4 94.0 93.1 89.7 85.7 82.4 67.6 63.7 77.8 66.4
Ionosphere 93.4 84.8 88.1 96.0* 93.4 86.1 67.6 80.8 74.8 62.3 65.6 78.2
Pima 80.4* 77.4 70.2 80.4* 78.6 75.0 75.0 72.6 65.5 67.3 70.8 66.7
Mnist 98.0* 96.9 92.0 97.4 98.1* 96.6 92.0 97.2 94.6 94.3 93.1 96.4
Rank 1.9* 7.0 13.3 3.6 2.8 5.4 14.2 9.2 12.2 14.7 15.8 13.3
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 56.9 56.5 63.3 64.9 64.0 58.2 65.9 71.6 70.9
AR database 75.5 86.7 42.2 24.0 79.3 69.7 24.2 67.4 46.2 78.6 79.3
SPDM 73.4 84.7* 62.6 66.4 44.5 52.5 65.3 68.0 54.7 59.5 69.3
Monk1 80.3 83.6 67.4 66.0 64.6 64.8 66.0 66.2 44.4 72.0 66.7
Monk2 75.9 82.6 53.7 53.5 55.1 60.0 53.5 53.5 48.6 60.0 55.1
Monk3 89.4 93.5 78.9 80.6 63.9 81.3 80.6 81.3 75.5 86.3 80.8
Ionosphere 82.1 96.0 89.4 62.3 57.0 92.1 62.3 90.1 55.0 82.8 90.1
Pima 75.0 79.2 50.0 56.0 61.3 74.4 56.0 77.4 67.9 66.7 61.3
Mnist 88.6 97.6* 80.6 82.2 86.7 85.9 82.2 85.5 80.1 87.0 88.2
Rank 9.8 2.7 18.0 19.1 18.3 14.4 18.4 14.1 19.9 12.4 12.8
Note that the results obtained with the Homoscedastic criterion are generally better than
those given by the Fisher, Bregman and CV criteria. The best of the three results in each
of the discriminant methods is bolded. The symbol * is used to indicate the top result
among all algorithms. Rank goes from smallest (best) to largest.
Finally, recall that the goal of the Homoscedastic criterion is to make the Bayes
classifier in the kernel space linear. If this goal were achieved, one would expect a
linear classifier such as linear Support Vector Machines (SVM) to yield good classifi-
cation results in the corresponding subspace. We verified this hypothesis in our final
As mentioned earlier, the advantage of the proposed criterion is not only that it
achieves higher classification rates, but that it does so at a lower computational cost,
Table 2.5. Note that the proposed approach generally reduces the running time by
54
Table 2.2: Recognition rates (%) with nearest neighbor
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 82.8* 73.6 62.3 76.8 82.8* 81.0 62.3 71.6 76.2 74.6 68.0 70.6
AR database 96.7* 78.3 66.9 84.2 88.3 87.5 71.3 84.2 69.2 64.2 70.6 70.2
SPDM 84.9* 80.1 68.2 83.7 84.9* 84.2 68.2 83.3 73.9 75.6 33.5 70.3
Monk1 89.1* 84.5 78.2 87.5 84.3 89.6* 72.5 83.1 78.2 77.1 74.5 72.2
Monk2 77.8 83.1 86.1 75.7 80.1 75.2 77.6 70.1 85.0* 81.0 79.9 78.5
Monk3 94.4* 87.7 81.5 89.8 93.5 88.0 89.4 82.4 82.1 81.3 77.6 80.3
Ionosphere 94.4 84.8 91.4 94.0 94.4 86.5 70.9 80.8 87.4 86.1 90.1 86.1
Pima 75.0 73.8 66.7 76.8 70.2 69.8 64.9 72.6 67.3 67.3 66.1 69.1
Mnist 97.8* 96.9 91.8 97.2 97.2 97.1 91.8 96.7 95.6 95.4 92.1 95.5
Rank 2.9* 8.0 13.6 5.3 3.7 7.7 15.4 10.8 11.3 12.7 15.7 14.1
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 62.2 64.3 64.3 59.8 73.6 56.5 63.6 71.6 70.6
AR database 75.5 86.7 42.5 58.6 77.7 77.0 59.1 67.5 41.8 78.6 77.7
SPDM 73.4 84.7 75.0 81.5 66.5 48.8 81.1 65.3 54.1 59.5 66.1
Monk1 80.3 83.6 90.3* 81.3 69.0 68.3 81.0 84.2 61.6 72.0 75.7
Monk2 75.9 82.6 68.3 66.7 67.4 82.6 79.6 83.6 82.4 60.0 67.4
Monk3 89.4 93.5 87.8 87.3 70.6 83.6 88.4 84.5 80.6 86.3 85.9
Ionosphere 82.1 96.0* 89.4 92.1 74.8 88.8 92.1 88.7 68.2 82.8 93.4
Pima 75.0 79.2 56.0 64.3 57.7 69.1 62.5 68.5 66.8 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.7 85.6 89.3 80.6 96.0 87.0 93.7
Rank 12.6 3.2 14.2 14.3 18.4 15.4 12.7 14.3 17.4 16.2 14.1
Table 2.3: Recognition rates (%) with the smooth nearest-neighbor classifier
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 83.5* 73.9 62.3 76.4 83.5* 82.8 62.3 72.9 76.2 74.2 68.2 71.2
AR database 96.6* 78.5 66.9 85.1 90.6 86.7 71.3 85.1 70.9 63.2 70.6 72.6
SPDM 84.3* 75.3 68.2 83.9* 84.3* 83.4 68.2 82.6 75.6 77.9 35.6 71.5
Monk1 90.2* 76.6 71.5 82.9 89.6 87.7 72.2 88.7 65.2 62.0 61.4 62.3
Monk2 83.3* 77.5 60.6 75.7 80.6 82.9 73.8 78.5 74.1 64.8 62.3 56.9
Monk3 94.6* 83.3 86.1 86.3 93.5 92.4 89.4 91.2 68.5 64.8 85.4 66.2
Ionosphere 94.3 84.8 84.8 86.1 94.3 86.8 80.1 86.8 80.8 82.8 77.5 78.1
Pima 80.4* 76.8 79.2 76.2 78.6 73.0 64.9 69.0 72.0 67.9 69.0 67.9
Mnist 97.8* 96.9 91.8 97.3 97.2 97.2 91.8 96.7 95.6 95.4 92.1 95.6
Rank 1.2* 9.4 14.4 6.7 2.7 4.6 15 6.9 14.2 14.7 17.6 16.1
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 60.3 67.1 64.3 63.5 71.2 59.1 64.3 71.6 72.3
AR database 75.5 86.7 49.5 44.5 70.9 77.3 60.2 67.5 35.5 78.6 70.9
SPDM 73.4 84.7 75.1 77.0 56.2 50.2 81.2 53.4 50.2 59.5 69.5
Monk1 80.3 83.6 77.3 78.2 67.4 77.8 69.4 71.5 59.0 72.0 79.2
Monk2 75.9 82.6 58.6 56.7 70.6 70.6 70.4 58.3 72.0 60.0 70.6
Monk3 89.4 93.5 91.2 89.7 70.8 91.9 89.6 93.8 87.0 86.3 90.5
Ionosphere 82.1 96.0* 82.1 82.1 74.8 83.4 91.1 94.0 62.9 82.8 89.4
Pima 75.0 79.2 60.7 70.2 57.7 70.2 63.8 72.6 66.1 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.8 86.0 89.4 82.6 96.1 87.0 93.5
Rank 11.6 2.7 15.6 14.6 17.8 13.4 13.9 14.9 18.1 14.9 11.9
55
Table 2.4: Recognition rates (%) with linear SVM
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 83.0* 73.6 61.9 77.4 83.0* 82.2 61.9 71.3 75.6 75.2 65.6 74.6
AR database 88.1* 79.6 65.5 83.1 87.5* 86.7 69.5 83.1 79.4 75.7 72.5 78.6
SPDM 82.1 84.6* 67.5 82.3 82.1 83.6 67.5 82.6 82.2 82.9 52.7 84.0
Monk1 89.1* 88.2 50.0 86.1 84.7 89.7* 52.1 86.1 69.9 62.5 50.0 63.4
Monk2 77.1 81.5 67.1 73.8 80.1 75.2 67.1 75.1 67.1 83.1* 67.1 67.1
Monk3 95.6* 91.9 47.2 94.4 92.8 89.1 47.2 81.5 81.7 81.7 47.2 81.0
Ionosphere 93.4 86.1 82.1 96.7* 93.4 86.1 82.1 82.1 82.1 82.1 82.1 82.1
Pima 79.8* 78.6 64.9 79.8* 78.0* 75.0 64.3 72.8 64.3 64.3 64.3 64.3
Mnist 97.9 96.9 92.0 97.3 98.1* 96.7 92.0 97.2 94.7 94.3 93.3 96.2
Rank 2.8* 5.6 17.8 4.3 4.1 5.8 17.7 9.5 11.9 11.6 17.3 13.0
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 65.3 60.1 65.3 61.8 68.4 68.4 62.1 71.6 67.8
AR database 75.5 86.7 42.1 66.7 79.3 69.7 67.2 70.1 44.2 78.6 79.3
SPDM 73.4 84.7* 66.7 76.5 50.3 49.0 82.1 69.3 45.5 59.5 69.0
Monk1 80.3 83.6 88.4* 67.8 65.6 66.4 67.8 68.5 44.9 72.0 66.7
Monk2 75.9 82.6 50.0 67.1 67.1 67.5 65.6 67.1 67.1 60.0 67.1
Monk3 89.4 93.5 94.4 81.3 63.9 83.3 80.6 81.9 78.5 86.3 84.7
Ionosphere 82.1 96.0 82.1 84.8 84.8 88.1 93.4 93.4 82.1 82.8 90.1
Pima 75.0 79.2 64.3 68.6 64.9 76.8 77.4 76.2 76.2 66.7 64.9
Mnist 88.6 97.6 81.0 82.2 86.9 85.9 83.1 85.4 80.1 87.0 88.2
Rank 11.5 3.3 16.1 16.1 15.8 14.6 13.6 12.5 19.0 13.8 12.9
56
2.5.2 KBA criterion
We now present results using KBA criterion. We use this criterion in KDA and
KSDA. We use the notation KDAK and KSDAK to indicate that the KBA criterion
Table 2.6: Recognition rates (%) with nearest neighbor. Bold numbers specify the top
recognition obtained with the three criteria in KSDA and KDA. An asterisk specifies
a statistical significance on the highest recognition rate.
The linear and nonlinear feature extraction methods described earlier are used
the classical RBF kernel defined earlier. In this low-dimensional space, we provide
successful classification results using three methods: the classical nearest neighbor
57
Table 2.7: Recognition rates (%) with the classification method of [75].
Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca
ETH-80 84.6* 73.9 76.4 84.6* 82.8 72.9 60.3
AR database 89.6* 78.5 85.1 87.5 86.7 85.1 49.5
SPDM 84.9* 75.3 83.9 84.9* 83.4 82.6 75.0
Monk1 88.0* 76.6 82.9 87.3 87.7 88.7* 77.3
Monk2 82.9* 77.5 75.7 82.9* 82.9* 78.5 58.6
Monk3 90.5 83.3 86.3 92.6 92.4 91.2 91.2
Ionosphere 92.8* 84.8 86.1 89.1 86.8 86.8 82.1
Pima 78.6* 76.8 76.2 76.2 73.0 69.0 60.7
Data set pca lda nda apac hlda rda sda
ETH-80 67.1 64.3 63.5 71.2 59.1 71.6 72.3
AR database 44.5 70.9 77.3 60.2 67.5 78.6 70.9
SPDM 77.0 56.2 50.2 81.2 53.4 59.5 69.5
Monk1 78.2 67.4 77.8 69.4 71.5 72.0 79.2
Monk2 56.7 70.6 70.6 70.4 58.3 60.0 70.6
Monk3 89.7 70.8 91.9 89.6 93.8* 86.3 90.5
Ionosphere 82.1 74.8 83.4 91.1 94.0* 82.8 89.4
Pima 70.2 57.7 70.2 63.8 72.6 66.7 57.7
(NN) classifier, the extension of K-NN defined in [75], and a linear Support Vector
From these results, it is clear that, on average, the derived KBA criterion achieves
higher classification rates than the Fisher criterion and CV. As expected, KSDA
generally yields superior results than KDA. This is due to the added flexibility on
modeling the underlying class distributions in the kernel space provided by KSDA. To
illustrate the effectiveness of the proposed criterion in KSDA, we show the smooth-
ness of the function optimized by the criterion in Fig. 2.7 for four of the data-sets.
Note how these functions can be readily optimized using gradient ascent. It is also
interesting to note that the optimal value of remains relatively constant for different
58
Table 2.8: Recognition rates (%) with linear SVM.
Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca
ETH-80 84.2* 73.6 77.4 84.2* 82.2 71.3 65.3
AR database 86.7* 79.6 83.1 85.3 86.7* 83.1 42.1
SPDM 84.3* 84.6* 82.3 84.3* 83.6 82.6 66.7
Monk1 87.3 88.2 86.1 87.3 89.7* 86.1 88.4*
Monk2 82.9* 81.5 73.8 82.9* 75.2 75.1 50.0
Monk3 93.5 91.9 94.4* 91.9 89.1 81.5 94.4
Ionosphere 92.6 86.1 96.7* 89.1 86.1 82.1 82.1
Pima 79.8* 78.6 79.8* 77.4 75.0 72.8 64.3
Data set pca lda nda apac hlda rda sda
ETH-80 60.1 65.3 61.8 68.4 68.4 71.6 67.8
AR database 66.7 79.3 69.7 67.2 70.1 78.6 79.3
SPDM 76.5 50.3 49.0 82.1 69.3 59.5 69.0
Monk1 67.8 65.6 66.4 67.8 68.5 72.0 66.7
Monk2 67.1 67.1 67.5 65.6 67.1 60.0 67.1
Monk3 81.3 63.9 83.3 80.6 81.9 86.3 84.7
Ionosphere 84.8 84.8 88.1 93.4 93.4 82.8 90.1
Pima 68.6 64.9 76.8 77.4 76.2 66.7 64.9
values of Hi . This smoothness in the change of the criterion is what allows to find
2.6 Conclusions
criminant analysis. The first approach optimizes the parameters of a kernel whose
function is to map the original class distributions to a space where these are optimally
(w.r.t. Bayes) separated with a hyperplane. We have achieved this by selecting the
kernel parameters that make the class Normal distributions most homoscedastic while
demonstrated that this approach achieves higher recognition rates than most other
59
0.1 0.09 0.3
0.4
0.08
0.08 0.25
(7)
(7)
0.06 0.2
0.07
0.2
0.04
10 0.06
0
20 0.15
5 6 0.05 10
4 10 5
2 0.1
0 0 Hi 0 0 H
i
0.09
0.06
0.08
0.055 0.1
0.08 0.07
0.05 0.08
0.06
0.06 0.045
(7)
0.06
0.05
(7)
0.04
0.04 0.04 0.04
10 0.035 10
0.02 0.03 0.02 0.03
4 5 6 5
H 4
2 i 0.025 2
0 0 0 0 H
i
Figure 2.7: Plots of the value of the derived criterion as a function of the kernel parameter
and the number of subclasses. From left to right and top to bottom: AR, ETH-80, Monk
1, and Ionosphere databases.
60
methods defined to date. We have also shown that adding the subclass divisions to
the optimization process (KSDA) allows the DA algorithm to achieve better gener-
alizations. And, we have formally defined the relationship between KDA and other
The second approach we have defined is directly related to the Bayes error. We first
derive a function which computes the Bayes accuracy, defined as one minus the Bayes
error, in the kernel space. Thus, the goal is to find that kernel representation where
number of databases shows that the derived approach yields superior classification
when used in KSDA, the proposed criterion can accurately recover the underlying
61
CHAPTER 3
3.1 Introduction
Regression analysis has been a very active topic in machine learning and pattern
standard regression problem, a linear or nonlinear model is estimated from the data
such that the functional relationship between the dependent variables and the inde-
pendent variables can be established. Of late, regression with kernel methods [96, 85]
has become popular. The success of the kernel methods in regression comes from the
fact that they facilitate the estimation of nonlinear function using well-defined and
-tested approaches in, for example, computer vision [99], signal processing [94], and
bioinformatics [76].
In kernel-based regression, the goal is to find a kernel mapping that converts the
original nonlinear problem (defined in the original space) into a linear one (in the
kernel space) [84]. In practise, this mapping is done using a pre-determined nonlinear
function. Given this function, the main challenge is to fine those parameters of the
function that convert a nonlinear problem into a linear one. Thus, the selection of
these kernel parameters is a type of model selection. This is the problem we consider
62
in this chapter to define a criterion for the selection of the appropriate parameters
If the parameters were chosen to minimize the model fit, we would generally have an
over-fitting to the training data. As a consequence, the regressed function would not
be able to estimate the testing data correctly. A classical solution is to find a good
fit, while keeping the complexity of the function low, e.g., using a polynomial of lower
order [42]. However, if the parameters are selected to keep the complexity too low,
then we will under-fit the data. In both these cases, the regressed function will have
a poor generalization, i.e., a high prediction error to the testing data. In general, the
of the kernel parameters is k-fold cross-validation (CV) [88]. In this approach, the
within the training set. The model which produces the smallest validation error is
selected. Unfortunately, this method has three known major drawbacks. First, it is
computational expensive. Second, only part of the training data is used to estimate
the model parameters. When doing model selection, one wants to employ the largest
possible number of training samples, since this is known to yield better generalizations
[63]. Third, the value of k as a parameter plays a major role in the process. Note that
the value of k affects the trade-off between the fitting error and the model complexity,
63
An alternative to CV is Generalized CV (GCV) [37, 96], an efficient approximation
to the leave-one-out CV. GCV has been efficiently applied to some model selection
problems [42, 116]. However, since it approximates the leave-one-out CV, the es-
timated result generally has a large variance, i.e., the regressed function is highly
While a single kernel may not be sufficient to describe the data, multiple kernel
learning (MKL) [51, 87] has attracted much attention recently as a potential alterna-
tive. In [76], MKL is applied to Support Vector Regression (SVR). The coefficients
that determine the combination of kernels are learned using a constrained quadratic
tions. Unfortunately, the selection of the kernel functions and associated parameters
remains an open problem. In another approach, the regression problem is first refor-
mulated as probabilistic models using Bayesian inference, then the kernel parameters
are selected by maximizing the marginal data likelihood. This approach has been used
to define the well-known Gaussian processes for regression [104]. It has been shown
[80] that the marginal likelihood has the nice property of automatically incorporat-
ing a trade-off between model fit and model complexity. However, since the Bayesian
essary, and, the results are generally computationally expensive. Furthermore, the
novel approach. In our proposed approach, the two measures of model fit and
64
(MOP) framework through the study of Pareto-optimal solutions. MOP and Pareto-
optimality are specifically defined to find the global minima of several combined cri-
teria. To this end, we will first derive a new criterion for model complexity which can
and derive a new approach called modified -constraint. We show that this newly
derived approach achieves the lowest mean square error. We provide extensive com-
parisons with the state of the art in kernel methods for regression and on approaches
for model selection. The results show that the proposed framework generally leads
the two new measures of model fitness and model complexity. Then, in Section 3.3,
we derive a new MOP approach to do model selection. In Section 3.4, the proposed
(xi , yi ), i = 1, ..., n generated from a joint distribution g(x, y), one wants to find
where f(x) is the regression function, f(x) = (f1 (x), ..., fq (x))T , fi (.) : Rp R is
the ith regression function, and L(y, f (x)) is a given loss function, for instance, the
Pq
quadratic loss L(y, f(x)) = 12 ky f(x)k22 = 1
2 i=1 (yi fi (x))2 .
65
3.2.1 Generalization error
Holmstrom and Koistinen [44] show that by adding noise to the training samples
(both x and y), the estimation of the generalization error is asymptotically consistent,
i.e., as the number of training examples approaches infinity, the estimated general-
ization error is equivalent to the true one. The addition of noise can be interpreted
For convenience, denote the training set of n pairs of observation and prediction
Assume that the training samples zi are corrupted by the noise . Suppose the
distribution of is (). The noise distribution is generally chosen to have zero mean
where is the variance of the noise distribution, and ij is the delta function, with
We consider the following steps for generating new training samples by introducing
additive noise:
3) Set z = zi + i .
66
Thus, the distribution of a particular sample z generated from the training sample
training set is
n
1X
g(z) = (z zi ). (3.5)
n i=1
The above result can be viewed as a kernel density estimator of the true distri-
bution of the data g(z) [44]. The distribution of the noise (.) is the kernel function
n Z
1X
E= L(zi + i )( i )di . (3.7)
n i=1
m m X m
X L(z) 1X 2 L(z)
L(z + ) = L(z) + i + i j + O( 3 ). (3.8)
i=1 zi 2 i=1 j=1 z i zj
Assuming that the noise amplitude is small, the higher order term O(3 ) can be
67
1 Pq
Let L(z) be the quadratic loss, i.e., L(z) = 2 i=1 (yi fi (x))2 . Then,
m
X 2 L(zi )
j=1 zj2
m X q
1X 2 (yik fk (xi ))2
=
2 j=1 k=1 zj2
q
X p
X 2 2 q 2 2
1
(yik fk (xi )) X
(yik fk (xi ))
= 2
+
2 k=1 j=1 xij j=1 yij2
!2
q p p
X
X fk (xi ) X 2 fk (xi )
= + (fk (xi ) yik ) 2
+ 1
k=1 j=1 xij j=1 x ij
!2
q X
p
X
fk (xi ) 2 fk (xi )
= + (fk (xi ) yik ) + q, (3.10)
k=1 j=1 xij x2ij
where yij is the j th entry of vector yi and xij is the j th entry of vector xi . Substituting
E = Ef + Ec , (3.11)
with
n
1 X
Ef = kyi f(xi )k22 (3.12)
2n i=1
and
!2
n X q X p
1 X
fk (xi )
Ec = + (fk (xi ) yik )
2n i=1 k=1 j=1 xij
!
2 fk (xi )
+ p1 . (3.13)
x2ij
Therefore, the generalization error consists of two terms. The first term Ef mea-
sures the discrepancy between the training data and the estimated model, i.e., the
model fit. The second term Ec measures the roughness of the estimated function pro-
vided by the first and second derivatives of the function, i.e., the model complexity. It
controls the smoothness of the function to prevent it from overfitting. The parameter
controls the trade-off between the model fit and model complexity.
68
In order to minimize the generalization error E, we need to minimize both Ef and
Ec . However, due to the bias and variance trade-off [42], a decrease in the model fit
may result in an increase in the model complexity and vice-versa. The regularization
parameter may achieve a balance between the model fit and complexity to some
extent, however, there are two limitations for selecting to do model selection. First,
this suffers from several drawbacks as we discussed earlier. Second, note that our goal
that we cannot further decrease one without increasing the other. This means that
even when the appropriate is selected, minimizing E is not directly related to our
Section 3.3. We first derive the kernel models for model fit Ef and model complexity
Ec .
We can rewrite the above model as fi (x) = wiT x, i = 1, ..., q. In kernel methods
69
we get
We now derive solutions of Ec for two of the most used kernel functions, the Radial
70
2
where Ri = 1
4
Wi WiT , Wi is a np matrix with the j th column equal to exp kx12
xi k
2
2
T
(x1j xij ), . . . , exp kxn2
xi k
2 (xnj xij ) .
2 (xmj xij )2
where pij is an n1 vector whose mth (m 6= i) entry is 1
2
exp kxm2x2 i k 2
1 ,
Pp 2 fl (xi ) Pp
and ith entry is 0. Then j=1 x2ij
= Tl pi , where pi = j=1 pij .
Thus,
p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl pi
j=1 x ij
Using the above results, we can define the roughness penalty function in the RBF
kernel space as
q
X
Ec = Tl Ml qTl l + n , (3.16)
l=1
1 Pn 1 Pn
where M = 2n i=1 (Ri + ki pTi ), and ql = 2n i=1 yil pi .
71
1 1
Ef E
f
Ec 0.8 Ec
0.8
0.6
0.6
0.4
0.4 0.2
2 4 6 2 4 6 8
4
x 10
(a) (b)
Figure 3.1: The two plots in this figure show the contradiction between the RSS and the
curvature measure with respect to: (a) the kernel parameter , and (b) the regularization
parameter in Kernel Ridge Regression. The Boston Housing data-set [7] is used in this
example. Note that in both cases, while one criterion increases, the other decreases. Thus,
a compromise between the two criteria ought to be determined.
derivatives are,
Pn
fl (xi ) lm hxm , xi i
m=1
=
xij xij
Pn
(xTm xi + 1)d
m
= m=1 l
xij
n
X (xTm xi + 1)
= lm d(xTm xi + 1)d1
m=1 xij
Xn
= lm d(xTm xi + 1)d1 xmj
m=1,m6=i
72
where Bi = dCi CTi , Ci is a np matrix with the j th column equal to (xT1 xi + 1)d1 x1j ,
T
. . . , 2(xTi xi + 1)d1 xij , . . . , (xTn xi + 1)d1 xnj .
=Tl gij ,
where gij is a n 1 vector whose mth (m 6= i) entry is d(d 1)(xTm xi + 1)( d 2)x2mj
Pp 2 fl (xi )
and the ith entry is d(xTm xi + 1)d2 3(d 1)x2ij + 2(xTi xi + 1) . Then, j=1 x2ij
=
Pp
Tl gi , where gi = j=1 gij .
Thus,
p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl gi
j=1 xij
=Tl (ki giT )l yil giT l ,
Using the derivations above, the roughness function for the polynomial kernel can be
written as
q
X
Ec = Tl Nl uTl l + n , (3.17)
l=1
1 Pn 1 Pn
where N = 2n i=1 (Bi + ki giT ), and ul = 2n i=1 yil gi .
to the derivatives of the regressed function f(x). A commonly seen alternative in the
73
literature is the norm of the regression function instead. The L2 norm in the kernel
Hilbert space being the most common of norms used in this approach. This section
provides a theoretical comparison between the approach derived in this chapter and
this classical L2 norm alternative. In particular, we show that the L2 norm does
not penalize the high frequencies of the regression function, whereas the proposed
function.
To formally prove the above result, we write generalized Fourier series of f (x),
X
f (x) = ak k (x),
k=0
where {k (x)}
k=0 forms a complete orthonormal basis and ak are the corresponding
coefficients. A commonly used complete orthonormal basis is {sin kx, cos kx}
k=0 in
[, ], with k the index of the frequency component. Using this basis set, f (x) can
be written as
X
f (x) = a0 + (ak sin kx + bk cos kx), (3.18)
k=1
Let kf kH be the function norm defining the reproducing kernel Hilbert space, then
the L2 norm of f is
Z
kf k2H = |f (x)|2 dx
Z
!2
X
= a0 + (ak sin kx + bk cos kx) dx
k=1
X
=2a0 + (a2k + b2k ). (3.19)
k=1
Note that in this case, all the coefficient are equal, regardless of the frequency com-
ponent.
74
The complexity measure derived in the present chapter and given in (3.13) can be
reformulated as
!2
Z
f (x) 2 f (x)
Ec = + (f (x) y) dx, (3.20)
x x2
Moreover, remember for (3.11) that the generalization error E can be expressed
Compared to the L2 norm result shown in (3.19), the complexity measure (3.21)
of the proposed approach penalizes the higher frequency components of the regressed
function. This is due to the squared of the index of the frequency component seen in
(3.21). By emphasizing lower frequencies, the proposed criterion will generally select
explicit equation of the L2 norm of the regression function f in the kernel space. This
is given by,
q
X q X
X n X
n
kfk2H = kfi k2H = ij ik h, xj ih, xk i
i=1 i=1 j=1 k=1
Xq Xn X
n
= ij ik hxj , xk i
i=1 j=1 k=1
Xq
= Ti Ki . (3.22)
i=1
75
3.3 Multiobjective Optimization
the preceding section. Of course, in general, the global minima of these two functions
are not the same. For instance, a decrease in the fitting error may lead to an increase
in the roughness of the function, and vice-versa. This trade-off is depicted in Figure
3.1. In the plots in this figure, we show the performance of the two criteria with
respect to the their corresponding parameters, i.e., the kernel parameter and the
regularization parameter . As can be observed in the figure, the criteria do not share
optimization approach.
3.3.1 Pareto-Optimality
simultaneous optimization of more than one objective function. More formally, MOP
tors. Denote the vector of objective functions by z = u() = (u1 (), u2(), ..., uk ())T ,
The goal of MOP is to find that which simultaneously minimizes all uj (.). If
all functions shared a common minimum, the problem would be trivial. In general,
however, the objective functions contradict one another. This means that minimizing
76
one function can increase the value of the others. Hence, a compromise solution
is needed to attain a maximal agreement of all the objective functions [66]. The
This definition now allows us to give the following formal presentation of Pareto-
optimality.
the one where none of the components can be improved without deteriorating one or
more of the others. In most problems, there will be many Pareto-optimal solutions.
One classical method to find the Pareto-optimal solutions is the -constraint ap-
proach [39]. In this case, one of the objective functions is optimized while the others
77
Figure 3.2: Here we show a case of two objective functions. u(S) represents the set of all
the objective vectors with the Pareto frontier colored in red. The Pareto-optimal solution
can be determined by minimizing u1 given that u2 is upper-bounded by .
as follows,
arg min ul ()
S,
where l {1, ..., k}.
Figure 3.2 demonstrates the idea behind this approach. In this figure, we show
78
Definition 6. A decision vector S is weakly Pareto-optimal if there does not
exist another decision vector S such that ui () < ui( ) for all i = 1, ..., k.
From the above definition, we can see that the Pareto-optimal set is a subset of
the weakly Pareto-optimal set and that a weakly Pareto-optimal solution may be
It has been shown [66] that the solution of the -constraint method defined in
(3.24) is weakly Pareto-optimal. This means that the solution to (3.24) cannot be
In the following, we propose a modified version of this method and prove that the
The main idea of our approach is to reformulate the constraints in (3.24) as equal-
ities. This can be achieved if these equalities are multiplied by a scaler smaller than
Let h = (h1 , ..., hl1 , hl+1 , ..., hl )T . Then, the modified -constraint method is given
by
k
X
arg min ul () + s hj
,h
j=1,j6=l
S,
where s is a positive constant. We can now prove the Pareto-optimality of (3.25).
79
Pk
Theorem 7. Select a small scalar s satisfying s j=1,j6=l hj ul (x) ul (x ), where
S and h are the solutions of (3.25). Then, is Pareto-optimal for any given
Pk
Proof. Let S and h be a solution of (3.25). Since s j=1,j6=l hj ul () ul ( ),
Let us assume that is not Pareto-optimal. In this case, there exists a vector o S
such that ui ( o ) ui ( ) for all i = 1, ..., k and uj ( o ) < uj ( ) for at least one index
j.
inequalities hoi i hi i with at least one strict inequality hoj j < hj j . Canceling out
Pk Pk
i on each of the inequality and taking their sum, yields j=1,j6=l hoj < j=1,j6=l hj .
Pk
This contradicts the fact that the solution to (3.25) minimizes j=1,j6=l hj .
We can demonstrate the utility of this modified -constraint method in the fol-
lowing two examples. In our first example, the objective functions are given by
(
1 x1
u1 (x) =
x2 otherwise
and u2 (x) = (x 5)2 . In our second example, the two functions are given by u1 (x) =
2
1 e(x+1) and
( 2
1 e(x2) x 0.5
u2 (x) =
1 e2.25 otherwise.
In both these examples, we compare the performance of the proposed modified -
constraint approach and the -constraint method. This is illustrated in Figure 3.3. In
80
these figures, the blue stars denote the objective vectors and the red circles represent
the solution vectors given by each of the two methods. We see that in Figure 3.3a and
3.3c, the original -constraint method includes the weakly Pareto-optimal solutions,
whereas in Figure 3.3b and 3.3d the proposed modified approach provides the Pareto-
optimal solutions.
Using the solution defined above, we can formulate the parameter optimization
problem as follows,
arg min Ef () + sh
,h
subject to Ec () = h (3.26)
0 h s.
Note that given different s, we may have different Pareto-optimal solutions.
Hence, our next goal is to define a mechanism to determine an appropriate value for
where zf and zc are the ideal vectors of Ef () and Ec (), respectively, and wf , wc are
the weights associated to each of the objective functions. The incorporation of these
weights can drive the optimization to favor one objective function over the other. If
Ef ( ) (or Ec ( )) is close to its ideal value zf (zc ), then wf (wc ) should be relatively
small. But if Ef ( ) (Ec ( )) is far apart from it ideal value zf (zc ), then wf (wc )
81
40 40
objective points objective points
30 solution points 30 solution points
2
2
20 20
u
u
10 10
0 0
0 20 40 60 80 0 20 40 60 80
u u
1 1
(a) (b)
1 1
objective points objective points
0.98 0.98
solution points solution points
0.96 0.96
2
0.94
u2
0.94
u
0.92 0.92
0.9 0.9
0.88 0.88
0.7 0.8 0.9 1 0.7 0.8 0.9 1
u1 u1
(c) (d)
Figure 3.3: Comparison between the proposed modified and the original -constraint meth-
ods. We have used * to indicate the objective vector and o to specify the solution vector.
Solutions given by (a) the -constraint method and (b) the proposed modified -constraint
approach on the first example, and (c) the -constraint method and (d) the modified -
constraint approach on the second example. Note that the proposed approach identifies the
Pareto-frontier, while the original algorithm identifies weakly Pareto-solutions, since the
solution vectors go beyond the Pareto-frontier.
82
Algorithm 3.1 Modified -constraint algorithm
Input: Training set {(x1 , y1 ), ..., (xn , yn )}, 0 , h0 , 0 , s.
1. Calculate the ideal vector point (zf , zc ).
2. Specify the weights wf and wc using (3.28).
3. Obtain using (3.27).
4. Obtain using (3.26).
Return: The optimal model parameter .
wf = |Ef ( 0 ) zf |2 ,
wc = |Ec ( 0 ) zc |2 , (3.28)
Thus far, we have derived a MOP approach for model selection based on Pareto-
optimality. The most pressing question for us is to show that this derived solution
yields lower prediction errors than simpler, more straight forward approaches. Two
such criteria are the sum and product of the two terms to be minimized [113], given
by
and
where and are regularization parameters needed to be selected. Note that mini-
83
which is the logarithm of (3.30). We could use cross-validation to select the regular-
optimization approaches with the proposed approach will be given in the experiments
section.
Let us derive two kernel-based regression approaches using the kernels and MOP
criteria derived above. In particular, we use our derived results in Kernel Ridge
Ridge regression (RR) is a penalized version of the ordinary least squares (OLS)
solution. More specifically, RR regularizes the OLS solution with a penalty on the
norm of the weight factor. This regularization is used to avoid overfitting. Formally,
RR is defined as
where X = (x1 , ..., xn ), Ip is the p p identity matrix, yi = (y1i , ..., yni )T , and is
We can now extend the above solution using the kernel trick. The resulting method
i = (K + In )1 yi , i = 1, ..., q, (3.33)
84
In KRR, there are two parameters to optimize: the kernel parameter (e.g., in
the RBF kernel) and the regularization parameter . In the following, we derive a
Since both, the residual sum of squares term ER and the curvature term EC , are
We start with the derivations of the RBF kernel. In this case, we have
Pq
Ec Ki )T (yi Ki )
i=1 (yi
=
Xq
(yi Ki )
= 2 (yi Ki )T
i=1
q !
X K i
T
= 2 (yi Ki ) i + K ,
i=1
K 1
where
= 3
K D, defines the Hadamard product of two matrices of the same
dimensions, i.e., (A B)ij = Aij Bij , with Aij denoting the (i, j)th entry of matrix
i
A. D = [k xi xj k2 ]i,j=1,...,n is the matrix of pairwise sample distances, and
=
(K+In )1
yi = (K + In )1 K
(K + In )1 yi = (K + In )1 K . And,
i
Ec q
X Tl Ml qTl l + n
=
l=1
q
n X
1 X (Tl Ri l ) (kTi l ) T
= + pi l
2n i=1 l=1
!
(pTi l )
(kTi l yil ) ,
where
WiT
(Tl Ri l ) 2Tl Wi
l + WiT
l
4 3 Tl Ri l
= ,
4
WiT kxj xi k2 kxj xi k2
is a n p matrix whose (j, k)th entry is 3
exp( 22
)(xjk xik ), and
(kTi l ) kTi l
= l + kTi ,
85
(pTi l ) pTi l
= l + pTi ,
ki K pi
T Pp pT pT
Xq
ER (yi Ki )
= 2 (yi Ki )T
i=1
q
X i
= 2 (yi Ki )T K ,
i=1
i (K+In )1
where
=
yi = (K + In )1 (K + In )1 yi = (K + In )1 i . And,
q T M qT + n
Ec X l l l l
=
l=1
q !
X l l
= 2Tl M qTl
l=1
When using the polynomial kernel, we cannot employ a gradient descent technique
for finding the optimal value of d, because this is discrete. Thus, we will have to try
all possible discrete values of d (within a given range) and select the degree yielding
the smallest error. The derivations of Ef with respect to are the same for any
Pq
kernel, and Ec
= l=1 2Tl N
l
uTl
l
.
nition. The problem is well studied when there are no collinearities (i.e., close to
linear relationships among variables), but special algorithms are needed to deal with
86
deal with collinearities in the exploratory variables. Instead of using the original pre-
the principal components with small variances, a more stable estimate of the coef-
ficient {wi }i=1,...,q can be obtained. In this way, the large variances of {wi }i=1,...,q ,
m
X 1
wi = aj aTj Xyi , i = 1, ..., q, (3.34)
j=1 lj
where ai is the eigenvector of the covariance matrix associated to the ith largest
eigenvalue.
The above formulation can once again be calculated in the kernel space as,
m
X 1
i = vj vjT yi , i = 1, ..., q, (3.35)
j=1 j
(KPCR).
In KPCR, we need to optimize two parameters the kernel parameter and the
with respect to m is non-differentiable. But testing all possible value for m is compu-
tationally expensive, because the range of m is dependent on the size of the training
set. Here, we present an alternative approach to select the optimal subset. The basic
idea is to use the percentage of the variance r to determine the number of principal
Pm
i f Note that r can now change continuously
components, r = Pi=1
t
, t is the rank of K.
i=1 i
87
Since KPCR differs from KRR in the solution vector {i }i=1,...,q , we need to derive
Xm 1 vj vT
i j j
= yi
j=1
m
!
X 1 i T 1 vi T 1 viT
= 2 vi vi + v + vi yi ,
j=1 i j i j
i
where
= viT K v,
i
vi
= (K i Id )+ K v [59], and A+ is the pseudoinverse of
i
the matrix A.
r 2
i (r + r) = i (r) + ri (r) + (r)
2! i
r 3
+ i (r) + O(r 4 ),
3!
r 2
i (r r) = i (r) ri (r) + (r)
2! i
r 3
(r) + O(r 4 ).
3! i
i (r + r) i (r r)
i (r) = + O(r 2 ).
2r
88
Table 3.1: Results for KRR. Mean RMSE and standard deviation (in parentheses).
Table 3.2: Results for KPCR. Mean RMSE and standard deviation (in parentheses).
89
3.5 Experimental results
In this section, we will use the Pareto-optimal criterion derived in this chapter to
select the appropriate kernel parameters of KRR and KPCR. Comparisons with the
state of the art as well as the alternative criteria (i.e., sum and product) defined in
We select fifteen data-sets from the UCI machine learning databases [7] and
the DELVE collections [29]. Specifically, these databases include the following sets
and kin-8fh (8192/9). The Boston housing data-set was collected by the U.S. Census
Service and describes the housing information in Boston, MA. The task is to predict
the median value of a home. The auto mpg set details fuel consumption predicted
in terms of 3 discrete and 4 continuous attributes. In the slump set, the concrete
the price of a car based on 15 attributes. In the diabetes set, the goal is to predict the
level of the serum C-peptide. In the Wisconsin Diagnostic Breast Cancer (wdbc) set,
of the patients. The servo set concerns a robot control problem. The rise time of a
servomechanism is predicted based on two gain settings and two choices of mechan-
ical linkages. The task in the Pumadyn is to predict angular accreditation from a
90
simulation of the dynamics of a robot arm. And, the Kin set requires us to predict
the distance of the end-effector from a target in a simulation of the forward dynamics
of an 8 link all-revolute robot arm. There are different scenarios in both Pumadyn
To test our approach, for each data-set, we generate five random permutations and
conduct 10-fold cross-validation on each one. The mean and the standard deviations
are reported. In the experiments, we use the root mean squared error (RMSE) as our
measure of the deviation between the true response yi and the predicted response yi ,
Pn 1/2
i.e., RMSE = [n1 i=1 (yi yi )2 ] .
[47]. Recall that in our proposed modified -constraint criterion, we also need to
We compare our approaches to the two typical criteria used in the literature, Cross-
employ a 10-fold CV. The kernel parameter in the RBF is searched in the range
[ 2, + 2], where and are the mean and standard deviation of the distances
between all pairwise training samples. In the polynomial kernel, its degree is tested in
the range of 1 to 6. The regularization parameter in KRR is selected among the set
{105 , . . . , 104 }, and the percentage of variance r in KPCR is searched in the range
[0.8, 1]. Moreover, we compare our modified -constraint approach with the original
-constraint method.
91
Table 3.1 shows the regression results of KRR using both the RBF and the poly-
nomial kernels. A two-sided paired Wilcoxon signed rank test is used to check sta-
tistical significance. The error in bold is significantly smaller than the others at
significance level 0.05. We see that regardless of the kernel used, the proposed mod-
ified -constraint approaches consistently provide the smallest RMSE. We also note
that the modified -constraint approach obtains smaller RMSE than the -constraint
method.
Table 3.2 shows the regression results of KPCR using the RBF and polynomial
kernels. Once more, the proposed approach generally outperforms the others. Ad-
ditionally, as in KRR, the modified -constraint approach generally yields the best
results.
A major advantage of the proposed approach over CV is that it uses all the
training data for training. In contrast, CV needs to use part of the training data for
verification purposes. This limits the amount of training data used to fit the function
to the data.
We now provide a comparison with the methods available in the literature and
typically employed in the above databases. Specifically, we compare our results with
Support Vector Regression (SVR) [93] with the RBF and polynomial kernels, Multiple
Kernel Learning in SVR (MKL-SVR) [76], and Gaussian Processes for Regression
(GPR) [104]. In SVR, the parameters are selected using CV. In MKL-SVR, we
employ three kernel functions: the RBF, the polynomial and the Laplacian defined as
kxi xj k
k(xi , xj ) = exp
. The RBF kernel parameter is set to be the mean of the
92
Table 3.3: Mean and standard deviation of RMSE of different methods.
Data set/Method Modified -constraint SVRrbf SVRpol MKL-SVR GPR
Housing 2.89(0.77) 3.45(1.04) 5.66(1.88) 3.34(0.70) 3.05(0.82)
Mpg 2.51(0.52) 2.69(0.60) 4.03(0.96) 2.67(0.61) 2.64(0.50)
Slump 6.62(1.49) 6.77(1.90) 8.37(2.86) 6.90(1.41) 6.88(1.51)
Price 2.21(0.90) 2.40(0.84) 3.72(1.55) 2.51(0.91) 11.2(2.26)
Diabetes 0.55(0.23) 0.68(0.31) 0.78(0.39) 0.65(0.35) 0.59(0.20)
Wdbc 31.46(1.59) 32.08(4.76) 44.1(9.87) 32.20(4.65) 31.60(4.3)
Servo 0.51(0.29) 0.61(0.35) 1.37(0.41) 0.60(0.36) 0.57(0.30)
Puma-8nm 1.44(0.02) 1.44(0.03) 3.35(0.11) 1.51(0.02) 1.47(0.03)
Puma-8nh 3.65(0.03) 3.67(0.06) 4.55(0.07) 3.78(0.05) 3.65(0.03)
Puma-8fm 1.13(0.01) 1.17(0.02) 2.04(0.05) 1.21(0.03) 1.17(0.02)
Puma-8fh 3.23(0.01) 3.24(0.02) 3.84(0.06) 3.35(0.05) 3.23(0.01)
Kin-8nm 0.11(0.002) 0.12(0.002) 0.21(0.003) 0.16(0.03) 0.12(0.002)
Kin-8nh 0.18(0.001) 0.19(0.003) 0.23(0.01) 0.20(0.002) 0.18(0.002)
Kin-8fm 0.016(0.002) 0.043(0.002) 0.048(0.001) 0.045(0.002) 0.013(0.00009)
Kin-8fh 0.07(0.002) 0.047(0.0009) 0.06(0.006) 0.05(0.001) 0.043(0.0007)
Table 3.4: Comparison of our results with the state of the art.
Housing Mpg Slump Price Diabetes servo Puma-8nm
Best 3.46(0.93) 2.67(0.61) 6.79(1.89) 2.62(0.87) 0.68(0.25) 0.59(0.30) 1.47(0.03)
Ours 2.89(0.77) 2.51(0.50) 6.62(1.49) 2.21(0.90) 0.55(0.23) 0.51(0.29) 1.44(0.02)
Puma-8nh Puma-8fm Puma-8fh Kin-8nm Kin-8nh Kin-8fm Kin-8fh
Best 3.65(0.03) 1.17(0.02) 3.23(0.01) 0.12(0.002) 0.18(0.002) 0.013(0.00009) 0.043(0.0007)
Ours 3.65(0.03) 1.13(0.01) 3.23(0.01) 0.11(0.002) 0.18(0.002) 0.016(0.002) 0.07(0.002)
distances between all pairwise training samples; the degree of the polynomial kernel is
2 Pn Pn
set to 2; and in the Laplacian kernel is set as = n(n+1) i=1 j=i k xi xj k, where
n is the number of training samples. MOSEK [3] is used to solve the quadratically
matrices. In GPR, the hyperparameters of the mean and covariance functions are
We compare the results given by the above algorithms with those obtained by our
approach applied to KRR and using the RBF kernel, because this method tends to
93
Table 3.5: Regression performance with alternative optimization criteria.
Method KRRR KRRP PCRR PCRP
Data set Ours Sum Product Ours Sum Product Ours Sum Product Ours Sum Product
Housing 2.89(0.77) 3.06(0.78) 3.30(0.85) 3.71(0.87) 4.75(0.89) 4.66(1.75) 4.04(0.88) 9.46(5.73) 6.48(3.85) 8.45(1.72) 5.56(6.77) 4.97(3.98)
Mpg 2.51(0.52) 2.63(0.50) 2.64(0.49) 2.82(0.45) 4.34(2.51) 18.04(2.59) 3.00(0.58) 4.56(0.83) 4.25(0.69) 7.30(0.81) 5.48(6.04) 4.29(3.21)
Slump 6.62(1.49) 6.87(1.51) 6.85(1.51) 7.09(1.22) 8.03(1.78) 13.17(2.84) 6.39(1.53) 7.65(1.86) 7.64(1.87) 7.68(1.88) 14.70(9.04) 16.79(9.10)
Price 2.21(0.90) 2.73(1.45) 2.76(1.46) 3.08(1.20) 2.72(1.11) 3.10(1.11) 3.90(2.16) 4.17(2.85) 4.17(2.86) 6.06(1.93) 8.76(1.99) 13.43(20.9)
Diabetes 0.55(0.23) 0.66(0.29) 0.75(0.33) 0.52(0.17) 0.60(0.21) 3.47(2.15) 0.76(0.33) 0.86(0.45) 0.86(0.43) 1.01(1.47) 1.09(1.56) 0.65(0.22)
Wdbc 31.46(1.59) 47.99(7.01) 48.60(8.98) 34.11(4.23) 51.31(16.98) 64.29(73.02) 30.66(4.71) 34.01(4.86) 38.91(5.31) 34.47(10.27) 56.61(19.90) 56.58(32.64)
Servo 0.51(0.29) 0.60(0.31) 0.57(0.30) 0.70(0.25) 0.94(0.36) 1.33(0.40) 0.71(0.30) 0.83(0.50) 0.86(0.49) 1.13(0.25) 0.70(0.27) 0.91(0.33)
Puma-8nm 1.44(0.02) 1.50(0.03) 1.50(0.03) 1.42(0.02) 3.40(0.59) 3.89(0.04) 3.69(0.02) 2.25(0.38) 2.37(0.53) 3.71(0.32) 8.50(0.87) 6.96(3.00)
Puma-8nh 3.65(0.03) 3.80(0.03) 3.86(0.04) 5.08(1.26) 4.36(0.29) 4.54(0.03) 4.39(0.04) 3.90(0.64) 3.97(0.65) 4.58(0.29) 13.82(3.81) 11.27(5.41)
Puma-8fm 1.13(0.01) 1.18(0.01) 1.18(0.01) 1.27(0.01) 1.52(0.48) 2.79(0.12) 1.28(0.05) 2.58(0.97) 2.09(0.84) 1.29(0.005) 6.98(1.23) 3.88(2.81)
Puma-8fh 3.23(0.01) 3.28(0.02) 3.28(0.02) 3.78(0.16) 3.81(0.38) 3.79(0.08) 3.22(0.01) 3.40(0.38) 3.30(0.12) 3.75(0.24) 9.78(5.64) 7.82(5.44)
Kin-8nm 0.11(0.002) 0.12(0.008) 0.13(0.01) 0.18(0.0008) 0.21(0.02) 0.68(0.35) 0.19(0.01) 0.21(0.02) 0.22(0.02) 0.22(0.04) 0.27(0.24) 0.26(0.25)
Kin-8nh 0.18(0.001) 0.19(0.002) 0.19(0.002) 0.20(0.002) 0.22(0.005) 0.42(0.27) 0.21(0.007) 0.22(0.002) 0.23(0.01) 0.25(0.005) 0.50(0.47) 0.29(0.01)
Kin-8fm 0.016(0.002) 0.020(0.0005) 0.020(0.0005) 0.013(0.0001) 0.020(0.0001) 0.57(0.22) 0.05(0.01) 0.07(0.03) 0.06(0.01) 0.02(0.0001) 0.11(0.29) 0.03(0.01)
Kin-8fh 0.07(0.002) 0.05(0.0007) 0.046(0.0005) 0.046(0.0002) 0.05(0.0001) 0.75(0.30) 0.06(0.01) 0.08(0.03) 0.08(0.02) 0.07(0.07) 0.13(0.25) 0.06(0.02)
yield more favorable results. The comparisons are shown in Table 3.3. Note that our
between our results and the best results found in the literature. For the Boston
housing data-set, [91] reports the best fits with Relevance Vector Machine (RVM);
for the Auto mpg data-set, the best result is obtained by MKL-SVR [76]; for the
Slump data, [22] proposes a k nearest neighbor based regression method and shows
its superiority over others; for the price data-set, [100] reports the best result with
pace regression; Diabetes data-set is used in [24] and the best results is obtained
using Least Angle Regression; for the servo data-set, [26] shows that regression with
random forests gets best results; and for the last eight data-sets, Gaussian processes
to provide state of the art results [103]. The comparison across all the data-sets is
given in Table 3.4. We see that our approaches provide better or comparable results
to the top results described in the literature but with the main advantage that a
94
Table 3.6: Comparison with L2 norm.
Method KRRR KRRP PCRR PCRP
Data set Ours L2 norm Ours L2 norm Ours L2 norm Ours L2 norm
Housing 2.89(0.77) 3.45(0.95) 3.71(0.87) 4.96(0.92) 4.04(0.88) 4.36(0.96) 8.45(1.72) 7.40(1.72)
Mpg 2.51(0.52) 3.09(0.51) 2.82(0.45) 4.19(2.23) 3.00(0.58) 3.45(0.75) 7.30(0.81) 7.42(1.29)
Slump 6.62(1.49) 6.98(1.48) 7.09(1.22) 14.97(2.23) 6.39(1.53) 6.43(1.47) 7.68(1.88) 8.12(2.08)
Price 2.21(0.90) 2.81(1.21) 3.08(1.20) 2.45(3.77) 2.35(1.04) 2.73(1.31) 6.06(1.93) 5.88(1.73)
Diabetes 0.55(0.23) 0.68(0.25) 0.52(0.17) 0.78(0.20) 0.76(0.33) 0.87(0.43) 1.01(1.47) 0.94(1.40)
Wdbc 31.46(1.59) 32.10(4.56) 34.11(4.23) 42.69(13.41) 30.66(4.71) 30.69(4.66) 34.47(10.27) 45.79(15.69)
Servo 0.51(0.29) 0.90(0.31) 0.70(0.25) 0.96(0.34) 0.71(0.30) 0.73(0.31) 1.13(0.25) 1.03(0.25)
Puma-8nm 1.44(0.02) 1.47(0.03) 1.42(0.02) 3.84(0.04) 3.69(0.02) 3.37(0.04) 3.71(0.32) 4.21(0.16)
Puma-8nh 3.65(0.03) 3.75(0.03) 5.08(1.26) 4.66(0.06) 4.39(0.04) 4.19(0.14) 4.58(0.29) 4.61(0.31)
Puma-8fm 1.13(0.01) 1.23(0.01) 1.27(0.01) 1.63(0.49) 1.28(0.05) 1.26(0.003) 1.29(0.005) 1.58(0.64)
Puma-8fh 3.23(0.01) 3.23(0.01) 3.78(0.16) 4.06(0.03) 3.22(0.01) 3.30(0.12) 3.75(0.24) 3.97(0.52)
Kin-8nm 0.11(0.002) 0.17(0.001) 0.18(0.0008) 0.21(0.03) 0.19(0.01) 0.16(0.03) 0.22(0.04) 0.22(0.03)
Kin-8nh 0.18(0.001) 0.20(0.001) 0.20(0.002) 0.26(0.007) 0.21(0.007) 0.21(0.002) 0.25(0.005) 0.29(0.09)
Kin-8fm 0.016(0.002) 0.020(0.0003) 0.013(0.0001) 0.024(0.0005) 0.05(0.01) 0.03(0.003) 0.02(0.0001) 0.05(0.08)
Kin-8fh 0.07(0.002) 0.06(0.0007) 0.046(0.0002) 0.067(0.0005) 0.06(0.01) 0.06(0.004) 0.07(0.07) 0.05(0.003)
functions the sum and the product criteria. Here we provide a comparison of these
criteria and the approach derived in this chapter. In particular, we combine model
fit Ef and model complexity Ec via the summation and product in KRR and KPCR.
The regularization term in (3.29) and in (3.31) is selected by 5-fold CV. Table
3.5 shows the corresponding regression results. In this table, AR and AP denote the
method A with a RBF and a polynomial kernel, respectively. We see that these two
alternative criteria generally perform worse than the Pareto-optimal based approach.
We give a comparison between our complexity measure Ec and the commonly used
L2 norm. The results are shown in Table 3.6. We see that the proposed complexity
95
3.5.5 Age estimation
In the last two sections we want to test the derived approach on two classical
The process of aging can cause significant changes in human facial appearances.
We used the FG-NET aging database described in [2] to model these changes. This
data-set contains 1,002 face images of 82 subjects at different ages. The age ranges
from 0 to 69. Face images include changes in illumination, pose, expression and
occlusion (e.g., glasses and beards). We warp all images to a standard size and
constant position for mouth and eyes as in [60]. All the pictures are warped to a
images of one individual are shown in Figure 3.4. We represent each image as a
vector concatenating all the pixels of the image, i.e., the appearance-based feature
representation.
We generate five random divisions of the data, each with 800 images for training
and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. We can see
that the modified -constraint method outperforms the other algorithms. In [115],
the authors represent the images using a set of highly redundant Haar-like features
and select relevant features using a boosting method. We implemented this method
using the same five divisions of the data. Our approach is slightly better using a
96
Figure 3.4: Sample images showing the same person at different ages.
Table 3.7: MAE of the proposed approach and the state of the art in age estimation.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR [115]
MAE 5.85 6.59 13.83 6.46 6.95 7.18 15.46 5.97
The weather data of the University of Cambridge [102] is used in this experiment.
sured every hour during the day. These parameters include pressure, humanity, dew
point (i.e., the temperature at which a parcel of humid air must be cooled for it to
condense), wind knots, sunshine hours and rainfall. We use the data in a period of
five years (2005-2009) for training and the data between January and July of the year
2010 for testing. This corresponds to 1,701 training samples and 210 testing samples.
The results are in Table 3.8. In [77], the authors employed support vector regression
and report state of the art results. Our experiment shows that our approach performs
better than their algorithm. The predictions obtained from the modified -constraint
97
Table 3.8: RMSE of several approaches applied to weather prediction.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR
RMSE 0.81 0.83 0.90 0.87 0.95 1.07 2.53
approach are also plotted in Figure 3.5. We observe that our approach can provide
3.6 Conclusions
have been proposed for linear regressions, but their non-linear extensions are known
(e.g., the degree of the polynomial describing the function) could increase rapidly,
yielding poor generalizations on the unseen testing set [74]. To resolve this prob-
lem, we have derived a roughness penalty that measures the degree of change (of the
regressed function) in the kernel space. This measure can then be used to obtain esti-
mates that (in general) generalize better to the unseen testing set. However, to achieve
this, the newly derived objective function needs to be combined with the classical one
measuring its fitness (i.e., how well the function estimates the sample vectors). Clas-
sical solutions would be to use the sum or product of the two objective functions [113].
However, we have shown that these solutions do not generally yield desirable results
98
30
true
Max temperature
predicted
20
10
0
0 100 200
Days
Figure 3.5: This figure plots the estimated (lighter dashed curve) and actual (darker dashed
curve) maximum daily temperature for a period of more than 200 days. The estimated
results are given by the algorithm proposed in this chapter.
99
optimization approach based on the idea of Pareto-optimality. In this MOP frame-
work, we have derived a novel method: the modified -constraint approach. While
proven that the derived modified version does. Extensive evaluations with a large
variety of databases has shown that this proposed modified -constraint approach
The other major contribution of the chapter has been to show how we can use the
derived approach for optimizing the kernel parameters. In any kernel method, one
always has to optimize the parameters of the kernel mapping function. The classical
approach for this task is CV. This technique suffers from two main problems. First,
the entire sample set for training, because part of it is employed as a validation set.
But, we know that (in general) the larger the training set, the better. Our proposed
MOP framework is ideal for optimizing the kernel parameters, because it yields nice
objective functions that can be minimized with standard gradient descent techniques.
and GCV and the other state of the art techniques in kernel methods in regression.
We have also compared our results to those obtained with the sum and product
criteria. And, we have compared our results to the best fits found in the literature for
each of the databases. In all cases, these comparisons demonstrate that the proposed
approach yields fits that generalize better to the unseen testing sets.
100
CHAPTER 4
4.1 Introduction
The performance of the kernel methods greatly depends on the selection of the
ment in the generalization ability of the learning approaches [69, 105, 10, 19]. Ideally,
the choice of the kernel function is based on the prior knowledge of the problem do-
main. Unfortunately, in general, we do not have prior knowledge on the data, and
One of the most commonly used kernels in the literature is the Radius Basis Func-
kxi xj k2
tion (RBF), defined as, k(xi , xj ) = exp
, where is a kernel parameter.
In this kernel, data sample evaluation is equivalent to the likelihood calculation based
on Parzen windows [73, 25], which is a non-parametric density estimator. The Parzen
window size (i.e., the kernel parameter ) significantly affects the algorithms perfor-
mance. This parameter controls the size of the neighborhood centered at the point
that is being evaluated. Estimates with too large a will suffer from oversmoothing
(where the real underlying structure is obscured), while a too small will lead to a
wiggly estimate (which has too much statistical variability). It is important to note
that an important assumption associated with the use of a fixed is that the same
101
(a) (b)
Figure 4.1: A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density Adaptive (LA)
kernels are evaluated on the four points marked by . (a) Density estimation in the RBF
kernel uses a fixed window, illustrated by black circles. Note that this fixed window cannot
capture different local densities. (b) Density estimation with the proposed LA kernel.
means that the use of a fixed-shape kernel is only reasonable for evenly distributed
data.
However, in practice, the data is usually drawn from a complex distribution where
the local regions have distinct densities. In such cases, a kernel with a fixed shape
such as the RBF kernel will not perform well because it cannot adapt to local changes.
This problem is illustrated in Figure 4.1(a). In this figure, we see that the RBF kernel
parameter would fit some local regions well, but would not be appropriate for other
local regions with distinct densities. In these cases, the well-known overfitting and
102
A solution to this problem is to vary the kernel bandwidth of the Parzen density
estimate based on local densities. Some approaches have been proposed in the density
estimation literature. One well-known method is the k-nearest neighbor estimate [56],
where the density is estimated by varying the window size to accommodate k-nearest
A related class of approaches is called adaptive kernel estimate [9, 89, 48] which
explicitly modifies the window size according to the local data distributions. These
However, these methods could not be directly used in most kernel-based approaches
for classification, because the resulting kernel is not guaranteed to be a Mercer kernel
[84], i.e., the corresponding kernel matrix is not positive semi-definite. This will
indeed lead to several significant problems. First, a kernel function which is not
positive semi-definite will not induce a reproducing kernel Hilbert space [84]. If the
inner product is not well defined, then the kernel trick cannot be used. Second, in
Support Vector Machines (SVM) [92], the geometric interpretation (i.e., maximizing
the margin) is only available in the case of positive semi-definite and conditionally
positive semi-definite functions [82]. Also, in such cases, the solution is unique since
This chapter proposes a new class of kernels called Local-density Adaptive (LA)
kernels, which are guaranteed to be Mercer kernels. Thus, our kernels can be directly
used in any kernel-based approaches such as Kernel Discriminant Analysis (KDA) [67,
5], Kernel Principal Component Analysis (KPCA) [83, 71] and Kernel SVM (KSVM)
for nonlinear feature extraction and classification. The similarity of the pairwise
samples defined by LA kernels are constrained by the local density information, which
103
is calculated based on a weighted local variance measure. Thus, our kernels can
adaptively fit the local shape of the data while evaluating the sample similarities.
4.2.1 Motivation
must be selected. The Radial Basis Function (RBF) kernel is a popular choice. The
kernel parameter in this kernel is fixed for the entire data. Instead of using a
single for the estimate, it is also possible to represent the distribution using a
diagonal matrix with each diagonal entry measuring the variance of each dimension,
Pp (xli xlj )2
i.e., k(xi , xj ) = exp l=1 l
, where xli is the lth dimension of sample xi and p
is the dimension of the input space. Alternatively, we can use a full covariance matrix
It is important to note that the evaluation in the above kernels assumes the data
is Gaussian distributed with fixed variance over the entire feature space. The key
idea of this chapter is to build a kernel which can automatically vary its shape (i.e.,
A possible approach would be to adopt the local covariance matrix, which char-
acterizes the local structure of the data. Thus, a possible kernel function can be
k(xi , xj ) = exp((xi xj )T 1
ij (xi xj )) (4.1)
where ij = (i + j )/2 and i and j are the local covariance matrices centered
104
characterizes the local density information in the neighborhoods of xi and xj . The
k-nearest neighbors of xi .
Eq. (4.1) seems a reasonable kernel function, since the likelihood calculation is
now given by the local distribution. However, this function is not a Mercer kernel.
Note that if a kernel function k(xi , xj ) is a Mercer kernel, there exists a mapping
function (.) : Rp F such that k(xi , xj ) = (xi )T (xj ). The kernel function in
where 1 T
ij = Aij Aij and zi = Aij xi . Since (4.2) is an RBF kernel w.r.t z, there
= (Aij xi )T (Aij xj )
where ij (x) = (Aij x). Since ij (.) is dependent on the samples in the input space,
there does not exist a unique mapping for the kernel function in (4.1). This implies
Our goal is to derive a Mercer kernel which calculates the likelihood from the
105
kernel functions k1 (xi , xj ) and k2 (xi , xj ), i.e.,
If k1 and k2 are both Mercer kernels, then k is also a Mercer kernel [84]. k1 can be
Then we need to build k2 , which measures the local density. To derive k2 , let us start
a non-negative function on x.
Proof. Let q = (q(x1 ), q(x2 ), . . . , q(xn ))T be a n 1 vector, n the number of the
samples. Then the kernel matrix K can be written as K = qqT . Thus, for any
Rn ,
T K = T qqT = (T q)2 0.
This means that the kernel matrix K is positive semi-definite. And, hence, the kernel
We could thus define k2 as k2 (xi , xj ) = (xi )(xj ), where (x) 0, for all x. (x)
way to achieve this is to measure the variance of the data in the neighborhood of x.
Formally,
k
1X
(x) = kxi xk2 . (4.5)
k i=1
the k-nearest neighbors of x. This means the local variance information is calculated
106
only from the k samples which are closest to x and the influence of the other samples
that (4.5) is a special case of (4.6). To see this, we first denote Nk (x) as the set of
samples that are the k nearest neighbors of x. Then, a uniform kernel hx (.) is defined
as
(
1
m
, xi Nk (x)
hx (xi ) =
0, otherwise,
where m is a normalizing factor that ensures the kernel integrates to 1. This makes
Alternatively, we can incorporate the influences of all the samples in the input
space, as the soft neighborhood used in kernel regression [70, 42]. The weight of
each sample xi is calculated based on its distance from x. In this chapter, we adopt
2
the Gaussian kernel, hx (xi ) = 1
2
exp kxi2xk
2 , where is a scaling parameter.
Pn
kxj xi k2
j=1 exp 2 2
kxj xi k2
= Pn
kxj xi k2
. (4.7)
j=1 exp 2 2
1
Note that (4.7) can be rewritten as (xi ) = tr(xi ), where tr(.) is the trace of a
The equation above shows the relationship of (4.7) and the local covariance ma-
trices, which encode the information of local distributions. To demonstrate that (4.7)
107
5
0
100
50
0 5
10
5
0
5
10 10
Figure 4.2: This figure illustrates how the local variance measurement given by (4.7) is
used. The axis represents the magnitude of the variance around each sample.
can appropriately measure the local density information, we calculate the local vari-
ances of the data in Figure 4.1 using (4.7). The results are shown in Figure 4.2. The
axis represents the local variance around each sample. We see that this local variance
measure effectively captures the local density information. The local variances are
smaller for the samples in the high density regions, and larger for the samples in the
It now seems that (4.7) can be readily used in our LA kernel approach. However,
a limitation of (4.7) is that it is dependent of the scale of the data since it is related
data-set, the resulting kernel matrix could have very large values in each entry, which
108
should be added. One way to solve this is to normalize (4.7) with the average of the
(xi )
s (xi ) = 1 Pn , (4.8)
n i=1 (xi )
Combining the above results, we can define our proposed LA kernel function
k(xi , xj ) as
Recall that k1 (xi , xj ) can be any likelihood evaluation kernel function with a fixed
Note that the kernel defined in (4.9) falls into the class of conformal kernels [84],
which define a conformal transformation preserving the angles in the kernel space.
Wu and Amari [107] use a conformal kernel to increase the influence of the samples
This conformal kernel is modified in [106] to adaptively address the class imbalance
problem in SVM. Later, Gonen and Alpaydin [38] extend conformal kernels to multiple
kernel learning. In the present work, we have derived a completely different conformal
function s (x) which encodes the local density information such that the kernel can
Our kernel function in (4.9) calculates the similarity of pairwise samples based
on the likelihood of the local densities. This is equivalent to evaluating the local
likelihood using windows of different sizes. A large-size window is used for the regions
where samples are distributed sparsely, while a small-size window is applied to the
109
regions where the data density is high. An advantage of the proposed kernel function
is that it can achieve this goal without changing the window size explicitly. To see
this, consider the case where the neighborhood around sample xi is sparse. That
means the local variance of xi is very large, yielding large s (xi ). When a kernel
function such as the RBF is multiplied by s (xi ), the resultant likelihood becomes
large, which is equivalent to using a large-size window (large in the RBF case).
The case for a high density region can be similarly observed. Therefore, our kernel
can adaptively change the window size of neighborhoods with different densities in
an implicit way.
Moreover, note that a fixed-shape kernel function such as the RBF kernel is a
special case of our kernel, where the local variance measure s (xi ) is a constant for
every sample xi . Thus the function does not need to incorporate information of the
local density.
We provide a case study with the purpose of demonstrating the utility and advan-
tages of the newly derived kernel. We employ the RBF function in k1 (xi , xj ), since
we want to make a comparison between the proposed kernel and the RBF.
We generated a set of 500 samples for each of the two classes in the XOR problem,
Figure 4.3 (a). Each class is represented by a mixture of two Gaussians, i.e., two
subclasses per class. The means of these 4 subclasses are designed so that the data is
distributed in a XOR fashion, Figure 4.3 (a). In each class, the covariance matrices
of each subclass have different scales, controlled by a factor c such that Si2 = cSi1 ,
where Sij denotes the covariance matrix of the j th subclass in class i. The larger c is,
110
4
LA kernel
RBF kernel
2
0.85
Classification accuracy
0
0.8
2
0.75
4
0.7
6
8 0 20 40 60 80 100 120
5 0 5 Covariance factor c
(a) (b)
Figure 4.3: (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under different covariance
factors c. The proposed kernel obtains higher classification accuracies than the RBF as c
increases.
the more different the two covariance matrices become. Thus, this data-set provides
the performance evaluation of the kernels under different conditions where the local
We let KSVM be our classifier. The kernel parameters in the RBF kernel and
the proposed LA kernel are tuned using 5-fold cross-validation (CV). We then calcu-
late the classification accuracies using an independent test set drawn from the same
distributions.
in Figure 4.3 (b). We see that as c increases, the RBF results degrade rapidly, whereas
that of the LA kernel does not. This is because as the local regions become more
the utility of the proposed LA kernel using a variety of data-sets in Section 4.4.
111
4.3 Kernel Parameter Selection
Now that we have derived the LA kernel, the next question to answer is how to
select the kernel parameters. Given the kernel function, the success of the kernel
approach greatly depends on the selection of its kernel parameters. Next, we present
(CV) [42]. The training data is first partitioned into k parts. k-1 parts are used for
training, the other for validation. This process is repeated k times for each possible
value of the parameters. The parameters leading to the largest average validation
A first major problem with CV is its complexity. The training process has to be
repeated k times and the parameters are selected based on an exhaustive grid search.
limits its use in practice. The seconde major problem of CV is that only part of the
training data is used to estimate the model parameters. In general, one wishes to use
the largest possible number of training samples in search of better generalizations [63].
112
4.4 Experimental Results
k1 (xi , xj ) in (4.9) equal to the RBF and the Mahalanobis kernels, which we denote
kernels with the classical use of RBF and the Mahalanobis (denoted MA).
KSDA [113]. The parameter , and in these kernels, as well as the number of
subclasses of each class in KSDA are selected using the two criteria defined above:
5-fold CV and KBA. The regularization parameter in KSVM is also selected using
CV. In KDA and KSDA, we employ the nearest mean (NM) and the nearest neighbor
We apply the derived LA kernels to seven benchmark data-sets from the UCI
repository [7]. In the Monks problem, the goal is to discriminate two distinct postures
of a robot. Monk 1, 2, and 3 denote three different cases in this task. The NIH Pima
data-set is used for the detection of diabetes from eight measurements. In the BUPA
set liver disorders are detected from a blood test. The task in the Breast Cancer
data-set is to distinguish two classes: no-recurrence and recurrence. And, the goal of
the image segmentation data-set is to classify seven outdoor object categories from a
The classification results of these data-sets using CV are presented in Table 4.1.
In KDA and KSDA, the proposed kernels generally outperform the RBF and the
Mahalanobis kernels, regardless of the classifiers used in the reduced space. A similar
113
Table 4.1: Recognition rates (%) with CV in UCI data-sets.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 87.7 83.6 90.3 83.1 90.3 83.1 88.2 88.0 88.4 87.5
Monk 2 85.2 82.6 72.5 70.1 74.5 70.1 73.8 74.5 74.1 75.7
Monk 3 97.0 93.5 91.7 82.4 92.1 82.4 94.0 94.0 94.2 89.8
Pima 78.6 79.2 80.4 72.6 72.6 72.6 79.2 77.4 77.4 74.4
Liver 71.0 68.1 66.7 69.6 63.8 69.6 69.6 66.7 65.2 65.2
B. Cancer 72.8 70.1 68.8 67.5 68.8 66.2 68.8 59.7 66.2 71.4
Image-seg 93.3 91.2 93.1 90.7 94.1 93.0 93.1 90.7 94.1 93.0
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 89.6 82.6 84.5 81.0 85.2 81.9 85.0 81.0 85.0 81.9
Monk 2 83.6 82.4 71.5 73.8 75.5 77.8 79.6 78.9 81.3 81.3
Monk 3 94.2 93.1 94.4 93.1 92.6 91.7 94.4 93.1 92.6 94.0
Pima 76.2 76.2 79.8 78.6 76.2 75.0 78.6 76.8 76.2 72.6
Liver 73.9 72.5 71.1 68.1 71.1 68.1 71.1 66.7 71.1 63.8
B. Cancer 74.0 68.8 66.2 72.7 68.8 62.2 66.2 63.6 68.8 65.0
Image-seg 92.2 93.4 92.1 91.5 92.4 91.8 91.1 90.9 91.1 90.6
The higher classification accuracies are bolded.
observation can be made in KSVM, where the proposed LA kernels provide higher
classification accuracies. The results with the KBA criterion are shown in Table 4.2.
We see that although the kernel parameters are now selected using a different criterion,
the proposed kernels still outperform classical kernels in most of the data-sets.
plications, we apply them to two image databases. The first database we will use
is the ETH-80 [53]. This database is described in the previous chapters. We adopt
the typical leave-one-object-out test, i.e., the 41 images of one of the 80 objects are
114
Table 4.2: Recognition rates (%) with KBA criterion in UCI data-sets.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 94.7 94.0 86.8 87.3 87.5 87.3 87.0 88.0 88.1 89.4
Monk 2 82.2 79.9 78.7 78.2 79.2 82.9 79.2 78.2 80.6 82.9
Monk 3 96.5 95.1 94.9 92.6 93.1 92.6 96.3 94.0 95.1 94.7
Pima 77.4 81.5 81.6 76.2 78.6 76.2 81.6 78.6 78.6 73.2
Liver 72.5 68.1 65.2 65.2 60.9 63.7 69.6 65.2 66.7 59.4
B. Cancer 72.7 70.1 67.5 66.2 62.3 61.0 67.5 63.6 66.2 64.9
Image-seg 91.3 91.3 88.0 92.0 93.0 94.1 90.2 92.0 93.2 92.9
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 85.2 84.7 82.9 85.2 83.1 85.2 85.0 84.0 83.8 83.6
Monk 2 83.1 83.8 82.2 83.3 82.6 83.6 80.6 79.0 83.1 81.5
Monk 3 94.0 92.8 93.3 91.4 92.1 91.4 94.7 93.1 94.7 93.3
Pima 82.1 76.8 81.0 76.2 74.4 70.8 78.6 78.0 72.6 76.8
Liver 73.9 71.0 69.6 68.1 69.6 68.1 69.6 68.1 69.6 68.1
B. Cancer 70.1 68.8 63.6 62.3 66.2 62.3 66.2 67.5 67.5 61.0
Image-seg 91.5 89.0 92.1 90.0 92.7 90.1 91.5 90.0 92.3 90.1
used for testing and the images of the rest of the objects are used for training. This
The results are shown in Table 4.3 and 4.4. We see that our kernels generally out-
perform the RBF and the Mahalanobis kernels. Note that there is a big improvement
We also use the CMU PIE face database [86]. This database contains 68 subjects
with a total of 41,368 images. The face images were obtained under varying pose,
illumination and expression. We select the five near-frontal poses (C05, C07, C09,
C27, C29) and use all the images under different illuminations and expressions -
115
Table 4.3: Recognition rates (%) with CV in ETH-80.
KSVM KDAN M KDAN N KSDAN M KSDAN N
LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
83.6 81.8 80.4 71.6 80.2 71.6 80.4 71.6 80.2 71.6
LAM MA LAM MA LAM MA LAM MA LAM MA
77.0 74.6 76.6 70.2 77.0 70.4 76.6 70.2 77.0 70.4
Figure 4.4: Shown here are sample images from PIE data-set.
116
Table 4.5: Recognition rates (%) with CV in PIE database.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 72.5 69.7 75.3 72.6 75.9 73.1 75.3 72.6 75.9 73.1
20 89.2 87.6 93.4 92.1 94.2 92.4 93.3 84.8 94.3 84.8
40 94.7 93.2 96.3 94.6 96.7 94.9 96.5 92.6 96.8 92.6
60 96.6 95.5 97.3 96.1 97.4 96.5 97.3 96.4 97.4 96.4
80 98.4 98.0 97.7 96.8 98.0 97.2 97.7 96.5 98.0 96.4
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 73.6 71.2 66.9 61.0 66.8 61.0 66.9 61.0 66.8 61.0
20 89.6 88.5 89.8 85.8 89.7 85.7 88.0 83.3 87.9 83.1
40 93.4 92.7 91.2 89.6 91.4 89.6 92.5 90.7 92.3 90.6
60 95.8 94.9 93.3 91.2 93.3 91.5 93.3 91.2 93.3 91.5
80 97.8 96.7 95.7 93.4 95.9 93.2 95.7 93.4 95.9 93.2
All the face images were aligned, cropped and resized to a standard size of 32 32
pixels. Some sample images are shown in Figure 4.4. For each individual, we randomly
selected N (N=5, 20, 40, 60, 80) images for training and used the rest for testing.
The comparative results obtained from KSVM, KDA and KSDA are shown in Table
4.5 and 4.6. The LA kernel consistently achieves better recognition performance than
the RBF and the Mahalanobis kernels. Again, this illustrates the effectiveness of the
proposed approach.
4.5 Conclusions
of the learning approach. This chapter proposes a class of density adaptive Mercer
kernels which evaluate the sample similarity by taking into account the local data
117
Table 4.6: Recognition rates (%) with KBA criterion in PIE database.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 73.6 70.1 75.2 72.6 75.8 73.0 75.2 72.6 75.8 73.0
20 88.7 81.7 91.8 84.8 92.0 84.8 92.0 72.1 92.1 72.1
40 94.7 91.4 95.2 93.2 95.9 93.3 96.1 93.2 96.2 93.3
60 96.7 95.0 97.0 95.5 96.9 95.5 97.3 95.5 97.4 95.5
80 97.6 96.9 97.5 96.6 97.7 96.6 97.7 96.6 98.0 96.6
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 65.2 60.6 61.5 57.1 61.4 57.1 64.9 59.6 64.8 59.6
20 91.2 80.9 88.3 84.0 88.2 84.0 84.7 79.3 84.7 79.3
40 95.3 92.5 93.3 91.6 93.2 91.5 93.3 91.6 93.2 91.5
60 96.4 93.9 94.5 92.8 94.5 92.8 94.5 92.8 94.5 92.8
80 98.0 96.5 96.6 93.4 96.7 93.5 96.6 93.4 96.7 93.5
density. While the commonly used kernels such as the RBF and the Mahalanobis
kernels evaluate the entire data using a fixed window, the kernels derived in this
chapter can automatically adjust their window size to adapt to local regions with
different densities. This enables them to effectively handle data with multiple distri-
bution forms. The proposed LA kernel approach was successfully applied to KSVM,
KDA and KSDA and shown to yield higher classification accuracies than classical
118
CHAPTER 5
5.1 Introduction
Thus far, we have focused on the model selection problem where the kernel pa-
rameters are learned given a known kernel function. In many applications, however,
we do not have prior knowledge of the data. Thus, we do not know which kernel
function may perform better. A major open problem in kernel learning is to define
algorithms that find the kernel mapping function best suited to most problems. Ide-
Instead of learning the kernel parameters of a given kernel function, one could
try to directly learn the kernel matrix. Multiple kernel learning attempts to do just
that by combining a set of known kernel maps. For example, Cristianini et al. [18]
The coefficients determining how to combine the kernels are learned by aligning the
matrices with a target label matrix. Other authors, [51, 4, 49, 111] employ convex
optimization techniques within the context of Support Vector Machines (SVM) and
Kernel Discriminant Analysis (KDA). [50] proposes to learn the kernel matrix by
119
penalizing an Lp norm of the combination coefficients, leading to a more general
framework of multiple kernel learning. And, Crammer et al. [15] propose a boosting
approach based on the exponential and logarithmic loss. Finally, several nonlinear
The multiple kernel learning approach just described, however, suffers from two
others outperform them in different settings. Second, the kernel matrix can only be
searched within the space defined by these pre-specified functions. If the kernels and
their parameters are not appropriately specified, the learned kernel matrix will not
In this chapter, we derive an approach that overcomes these difficulties. Our ap-
proach borrows ideas from Genetic Algorithms (GA) modify a large set (population)
of randomly initialized kernel matrices to optimize the metric induced by the kernel
mapping without the need to know the underlying kernel function. By doing so, we
also avoid the need to combine or optimize over several possible (or known) kernel
matrices.
Key to our approach is the definition of several novel operators in GA. The two
classical operators used in the literature are crossover and mutation. The former,
of the next generation (called offspring). The second operator in GA adds random
mutations to existing individuals. These two procedures are however not sufficient to
efficiently search vast spaces [68], such as the one defining all possible kernel matrices.
In the present work, we derive three additional GA operators to facilitate this search.
120
One of the new operators emulates gene transposition. Consider the genome of a
species. Transposons are chunks of DNA that can move from one part of this genome
to another. This process was first described by Nobel Laureate Barbara McClintock
[64], when she noticed that the color changing pattern seen in corn is not random.
This effect was originally referred to as jumping genes. A typical gene transposition
is given by the cut-and-paste transposon. Here, enzymes cut a section of the DNA
and then insert it elsewhere. In our case, each genome describes a kernel matrix. A
the classification function seen in one area of the feature space will now be applied
to another section of the space. If this results in a lower classification error rate, the
Lacking a reproductive system, viruses need to insert their genome into that of the
invaded cell for replication. By doing so, gene coding and non-coding sections of the
host genome can vary. In our case, the insertion of a new section in the matrix could
Our third operator is deletion. In living organisms, sections of the genome may
be deleted during meiosis [81]. In our case, deletion of a section of the matrix could
rearrange the classifiers (i.e., norm defined by the kernel) in a positive way.
The GA operators defined above facilitate the search through a vast domain, thus
addressing the problem of multiple kernel learning listed above. After the matrices
of the current population have been modified to create the offsprings, we eliminate
those yielding the worst sample classification accuracies. The process is iterated until
convergence.
121
A problem with approaches that directly learn the kernel matrix (with no known
associated kernel function) is that they lack the capacity to map the test samples
Here, the testing data is used in combination with the training samples to resolve
the problem. Each time the testing data changes, the algorithm will compute a new
kernel matrix which can be used for both, the training and testing sets. This approach
test data because the kernel matrix has not been optimized for them.
the kernel values encoding the similarity between the training and testing samples
allowing us to map any new test sample. This eliminates the need of having to
relearn the kernel matrix each time a new test sample is to be classified. Our solution
matrix. We show that this approach yields superior results to transductive learning
since it directly represents the learned function rather than the training sample alone.
The rest of the chapter is organized as follows. Section 5.2 introduces the nuts
and bolts of the proposed genetic algorithm search. Section 5.3 derives the non-
linear regression learning of the underlying function defined by the kernel solution
for its application in classification. Section 5.4 does the same for regression. Section
122
5.2 Learning with Genetic Algorithms
convergence.
Genetic Algorithms (GA) constitute a set of tools that are well suited for solving
mathematical optimization problems in large spaces where there are multiple local
minima and no clear indication of how to find them [36]. This is especially practical
when the search space is so vast that, despite computational improvements, one would
require years (and potentially centuries) to solve the problems if a reasonable area of
set of individuals is called the population. The first key step in GA is to define an
appropriate coding of the problem data as a genome. The most typical coding is
a feature vector with each element defining one of the parameters (or features or
variables) that play a role in our optimization problem. In this representation, each
entry in the feature vector codes for a directly relevant variable in the optimization
we include non-coding segments in the feature vector (i.e., genome). As any biological
systems, the coding and non-coding segments alternate one another, Fig. 5.1(b). The
coding segments will be referred as genes (because they code for the kernel matrix
K which is our end result or outcome). This emulates the coding seen in actual
123
(a)
(b)
Figure 5.1: (a) The classical feature representation. Each entry in the feature vector codes
for a relevant variable in the optimization problem. (b) The proposed feature representation.
Each individual in the population is represented as a feature vector with coding and non-
coding segments. The lower case letters represent the coding (or gene) sequence used for
the calculation of the fitness function. Consecutive N labels indicate non-coding DNA.
biological systems. The elements defining the gene sequences are obtained from the
the non-coding DNA sequences are generated at random. Each gene is preceded by a
fixed sequence (or gene marker). This specifies where each of the genes starts in the
First we identify the positions of the gene markers, indicating where each coding DNA
sequence starts. Since the genes are of a specified length, they can be easily read,
124
The genome representation defined in this section will allow us to derive novel
operators, such as transposition, deletion and insertion. This is so because we can now
make use of the non-coding sections of the genome to address some of the limitations
Most GA use two major operators crossover and mutations. In crossover, two
[t] [t]
individuals, ui and uj , of the current population (i.e., two kernel matrices in our
[t]
case) are selected at random. Here, ui = (ui1 , . . . , uiq )T Rq , t specifies the iter-
ation or population cycle, and i, j {1, . . . , p}, p the number of individuals in the
[t+1]
population. An integer r [1, q] is selected at random. Two offsprings ui and
[t+1]
uj (i.e., two individuals of the new generation) are obtained as
[t+1]
ui = (ui1 , . . . , uir , ujr+1, . . . , ujq )T
[t+1]
uj = (uj1 , . . . , ujr , uir+1, . . . , uiq )T . (5.1)
form a distant area of the search space which could yield even higher classification
rates. While one of the matrices (say, ui ) helps classify samples in a region of the
feature space, the other matrix could be instrumental in the classification of the
The mutation procedure is meant to add random jumps within the search space
which are unlikely to occur with crossover or gradient descent techniques. Some
mutations will add small changes, with the aim to jump over a local minimum. Other
mutations will add large changes, moving the search to a completely different region
of the search space. The mutation operation works as follows. An individual from the
125
Figure 5.2: This figure illustrates the copy-and-paste transposition.
[t]
current population is selected at random, uk . A number s of its entries are randomly
[t]
selected, with s = qpm ; pm the mutation rate. Each of these entries uk (li ) is
[t]
uk (li ) = bi , li M, i = 1, ..., s. (5.2)
where M is the set containing the indices of the s selected entries. The mutation
value used in the above equation is bounded by the minimum and maximum of all
[t]
the entries of uk .
5.2.3 Transposition
While crossover and mutation are typically used in GA, nature makes use of a
one location to another within the genome. In our search space, transposition would
apply a local norm (or classifier) to a different region of the feature space. A norm
that does not work well in one area of the space, may be what is needed in another.
126
We model two major transposition mechanisms. The first one is called copy-and-
paste. Here, a short sequence of DNA is copied to RNA by transcription, and then
copied back into (inserted as) DNA by reverse transcription at a new position. This is
illustrated in Figure 5.2. Due to transcription noise, the copied sequence may diverge
slightly from its former self. To model this, let v be a transposon, v = (v1 , ..., vLt )T ,
where Lt its length. And, assume each entry of v is perturbed by a small Gaussian
vi = vi + szi , i P,
where vi is the entry after perturbation, s is the scale of the Gaussian noise, zi
N (0, 1), and P is the set containing the indices of the perturbed entries. Suppose a
genome u is selected and the insertion position is t, after copy-and-paste this becomes
this case, a sequence of DNA is cut from its original position and inserted into a new
position of the same genome, Figure 5.3. Since this process does not involve an RNA
intermediate, it is not affected by noise. Formally, denote the cut position t0 (with
t0 < t). Using the same notation above, we define the new individual u as
u = (u1 , ..., ut01 , ut0+Lt , ..., ut , vi1 , ..., viLt , ut+1 , ..., uq )T . (5.4)
The two transposition procedures described above work as follows. First, individ-
selected from a random location in the genome and used in either copy-and-paste or
cut-and-paste (at 50% each). Finally the transposon is inserted into a randomly cho-
sen position. Note that in the copy-and-paste mechanism, the length of the genome
127
Figure 5.3: This figure illustrates the cut-and-paste transposition.
is increased. This would not admissible if we were using the classical feature repre-
sentation, but is not an issue when we employ the coding-non-coding model defined
number of nucleotides can be deleted, from a single base pair to an entire piece of a
genome. In nature, deletion is generally harmful, but, in some occasions, can lead to
advantageous variations.
t, is chosen for deletion. More formally, denote u = (u1 , ..., uq )T , v = (v1 , ..., vLd )T ,
128
(a)
(b)
Figure 5.4: This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is formed.
Note that the length of the genome is hence decreased. Since the position of
is deleted, Figure 5.4(a). It is also possible to delete a coding part. In this latter case,
the non-coding DNA right after the deleted segment becomes the coding segment,
Figure 5.4(b).
Deletion can eliminate a local norm (or classifier) that was causing problems and
substitute this for a randomly initialized alternative that can be improved with the
other optimization tools. This procedure can be especially useful for leaving large
of insertions is viral infections, where viruses integrate their genome into that of the
129
host cell. The effect of insertion depends greatly on the location within the hosts
genome.
is a Lv 1 vector, and r is the size of the population. The virus population is allowed
The operators described above are used to generate d > p individuals. The number
to determine the best fitted p individuals that will survive and thus become the
from pre-specified kernel functions. This process combines the characteristics of differ-
ent kernel functions and introduces much needed randomness to the initial population.
[0]
The initial population set is formally defined as {K1 , . . . , K[0]
p }.
A selection criterion is then used to determine the most fitted individuals that
are to survive to the next iteration. Since our goal is classification accuracy, we
employ the Bayes accuracy criterion of [114], which is the one minus the Bayes error
130
More formally, let X = {x11 , . . . , x1n1 , . . . , xCnC } be a given training set, where
xij is the j th sample in class i, ni is the number of samples in class i and C is the total
number of classes. Let (.) : Rl F be a function defining the kernel map, where l
is the dimension of the input space. We assume that data has been whitened in the
kernel space, and denote K as the whitened kernel matrix for the training samples,
i.e., K = (X)T (X), where (X) = ((x11 ), . . . , (xini ), ..., (xCnC )). Then, the
of the kernel matrix for the samples in class i and j, (Xi ) = ((xi1 ), . . . , (xini )), 1i
is a ni 1 vector with all elements equal to 1/ni , w(.) is a weighting function, with
1 Rx 2
w( 2 et dt is the error function, and Sij is the
ij ) = 2
2 erf ( 2ij2 ), erf (x) = 0
ij
Ki = (X)T (Xi ) the subset of the kernel matrix for the samples in class i.
The kernel Bayes accuracy criterion defined in (5.7) is used to evaluate the fitness
of these d offsprings, i.e., gi = J(Ki ), i = 1, .., d, where gi is the fitness value of the
ith genome. Then, the individuals that will form the new population are selected as
follows. First, an elitist selection strategy is applied. This means that the pf best
fitted individuals are kept. Another set of pn is randomly selected from the bottom
131
10%, i.e., the less fitted individuals. The values of pf and pn are selected to be
second group is used to maintain diversity in the population, which may help us jump
away from local minima in the future. The rest of the individuals p pf pn are
selected at random using a roulette wheel rule [36]. In the roulette wheel rule, the
gi
pi = Pdpf pn , (5.9)
i=1 gi
is given by
[t+1] [t]
|gm gm | < , (5.10)
[t]
where gm is the maximum fitness value of the population at iteration t, and > 0 is
small.
multiple times with different initial populations, then keep the solution with the
best fitted individual of the final populations. The proposed kernel matrix learning
Once we learn the optimal kernel matrix of the training data from Algorithm 5.1,
we can use it in any kernel-based approaches such as KDA [67, 5], Kernel Subclass
DA (KSDA) [113] and SVM [92]. However, the only information we have is the
kernel matrix for the training data and we do not know the corresponding explicit
132
Algorithm 5.1 Kernel Matrix Learning with GA
Input: Training set, xi , ..., xn ,
Output: kernel matrix K
for i = 1 to a do
Generate initial population with K1 , ..., Kp .
repeat
1. Generate new individuals with operators from (5.1) to (5.6).
2. Calculate the fitness values gi of the new individual using (5.7).
3. Select survivors using (5.9).
[t+1] [t]
until |gm gm |<
Output: The most fitted individual, K(i).
end for
K = maxi J(K(i))
Return: K
kernel function to construct the kernel values which measure the similarity between
cast the classification problem as a transductive one. Given the labeled training and
unlabeled test samples, one generates a common kernel matrix including the two sets.
The kernel matrix is learned using the available approach such as the one defined
in this chapter. This means we need to relearn the kernel matrix each time a new
testing sample becomes available. One could say that the learned mapping does not
In the present section, we propose a novel solution to the above defined problem.
The idea is to estimate the underlying function represented by the learned kernel
matrix using regression. Formally, given X, i.e., a set of n training samples with
the (i, j)th entry in the learned kernel matrix, we want to find the function f(.)
133
providing the best estimate of the true (but unknown) underlying function, where
f(x) = (f1 (x), ..., fn (x))T , and fi (.) : Rl Rn is the ith regression function.
Let fi (x) = kxi (x) = hxi , xi. To learn this underlying function, we need to use
a non-linear approach. Kernel Ridge Regression (KRR) [42] provides the necessary
flexibility and computational efficiency for this task. KRR minimizes the cost function
n
1X T
L(W ) = kyi W (xi )k22 + kW k2F , (5.11)
n i=1
where Y = (y1 , ..., yn ) is an n n predictor matrix, G is the Gram matrix with its
(i, j)th entry defined as Gij = g(xi , xj ) for some known kernel function g, In is the
Our kernel matrix learning approach is a generic approach, since the learned
kernel matrix could be plugged into any kernel-based methods in the settings such as
134
classification, regression and clustering, provided that an appropriate fitness selection
For illustration we employ (KRR) [42], since it is commonly used in many applica-
tions. There are two types of parameters in KRR to be learned, the kernel matrix K
and the regularization parameter . We use the proposed GA-based approach defined
in the present work to jointly learn K and . The generalized cross-validation (GCV)
GCV is used for selecting in ridge regression, and can be formally written as
nk(In H())yk22
GCV () = ,
(tr (In H ()))2
where H() is the hat matrix which projects the label y to the corresponding predicted
label y, i.e.,
y = H()y.
In KRR, the predicted labels y for the training data can be obtained by
In order to jointly learn both K and , the value of is added at the end of each
genome (i.e., as a new allele). In such a way, the GA operations and selection do
135
1
0.5
0.98
classification accuracy
0.96
0
0.94
0.92
0 5 10 15 20 25 30
0.5 generations
0.6 0 0.6
(a) (b)
Figure 5.5: (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the classification
accuracy over the number of generations.
We first present a toy example to illustrate how the kernel matrix evolves and im-
proves during the generations using our genetic-based algorithm. We consider a XOR
data classification problem, Fig. 5.5(a). The data set contains two classes, and each
ing set from the same class distributions is generated to test the proposed approach.
increases. We see that the the classification accuracy gradually improves during the
We then illustrate how the learned kernel matrix evolves in Figure 5.6(a)-(f). In
the beginning, we observe that sample similarity diverges a lot within each class,
136
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
Figure 5.6: In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations.
137
which is due to the fact that the distance between the two clusters of the same class
is much larger than that within the same cluster. This means that the Euclidean
distance measure in the original space cannot capture the underlying sample similarity
within each class well. A good kernel matrix should indicate that the within-class
similarity is much larger than the between-class similarity. We can see in Figure 5.6
that using our algorithm, the within-class similarity gradually increases as the kernel
matrix evolves. This implies that our learned kernel matrix could induce a kernel
space where samples in the same class are as close as possible whereas samples in the
different classes are as far apart from each other as possible, leading to a much easier
classification problem.
To further evaluate how the kernel matrix is optimized during the generations, we
adopt the kernel alignment [18] to measure how close a learned kernel matrix to an
the class label of xi . The kernel alignment between two K and K0 is defined as
hK, K0iF
A(K, K0 ) = q ,
hK, KiF hK0 , K0 iF
where h., .iF is the Frobenius norm between two matrices and defined as hK1 , K2 iF =
P P
i j K1 (xi , xj )K2 (xi , xj ). The higher the kernel alignment is, the more similar the
two kernel matrices are. The kernel alignment between the learned kernel matrix and
the ideal one is shown in Figure 5.7. We see that the learned kernel matrix gets closer
138
0.75
0.7
0.65
alignment
0.6
0.55
0.5
0.45
0 5 10 15 20 25 30
generations
Figure 5.7: This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations.
We employed the derived approach to learn the kernel mapping of two popular
data-sets.
In KDA, the results are compared to kernel selection with CV, the Fisher criterion
of [108] and KBA [114]. These are denoted KDACV , KDAF , KDAK , respectively. The
nearest mean classifier is used in each of the corresponding subspaces. We choose this
classifier because it is Bayes optimal if the data in the kernel space is linearly sepa-
139
In SVM, our results are compared to those obtained with CV, transductive learn-
ing, and traditional GA, SVMCV , SVMT and SVMT R . We also provide a comparison
with the multiple kernel learning algorithm of [4]. This algorithm applies sequen-
algorithm Support Kernel Machine (SKM). We also use this learned kernel matrix in
KDA and denote it KDAS . As a baseline, we also compare to the algorithm where
denote this algorithm to be KDAU and SVMU . For all the algorithms using a single
kernel function, the RBF kernel is used. For those algorithms where the parameters
supervised kernel matrix learning approach called kernel propagation (KP) [45]. In
this approach, the full kernel matrix is constructed from a seed-kernel matrix by max-
imizing the smoothness of the mapping over the data graph. The parameter of the
heat kernel used in calculating the affinity matrix is set as the averaged Euclidean dis-
tance from each data point to its ten nearest neighbors [45]. We denote this method
typical range is given for the parameters of each kernel. The parameter for RBF
kernel is in [m1 2t1 , m1 + 2t1 ], where m1 and t1 are the mean and standard deviation
of the pairwise sample squared distances; the parameter for Laplacian kernel is in
140
Table 5.1: The parameters used in the experiments
Parameter value Description
pc 0.8 crossover rate
pm 0.05 mutation rate
pf 0.05 percentage of the best fitted individuals
kept
pn 0.03 percentage of the least fitted individu-
als kept
Lc 4 length of each gene
Lnc 10 length of each non-coding sequence
Lt 3 length of transposon
pt 0.02 transposition rate
s 0.01 scale of the Gaussian noise in transpo-
sition
pv 0.01 perturbation rate for each entry in
transposon
pd 0.01 deletion rate
Ld 3 mean of the deletion length
L2 d 4 variance of the deletion length
pi 0.1 insertion rate
Lv 5 length of each virus
r 6 size of the virus population
[m2 2t2 , m2 + 2t2 ], where m2 and t2 are the mean and standard deviation of the
pairwise sample distances; in the polynomial kernel, the degree is in [1, 5]; in the
hxi , xj i
hxi , xj i = q . (5.15)
hxi , xi ihxj , xj i
The parameter setup in our experiments is shown in Table 5.1. KRR is used to
train the embedding function, and the RBF kernel is used in KRR.
141
Table 5.2: KDA Recognition rates (in percentages) in the UCI data-sets.
Data-set KDAGA KDAT R KDAKP KDAT KDAU KDAS KDAK KDAF KDACV
Breast C. 76.4(2.9) 72.3(2.4) 70.4(3.9) 69.1(4.1) 65.8(3.7) 62.6(4.6) 68.4(3.3) 64.4(2.1) 66.6(5.2)
Ionosphere 93.4(1.3) 92.3(2.4) 87.2(2.3) 85.1(2.4) 94.6(2.0) 74.6(5.9) 80.6(4.1) 80.6(4.1) 86.6(1.2)
Liver 80.6(3.8) 76.8(3.1) 66.0(6.1) 66.4(5.8) 66.4(5.7) 69.9(7.1) 65.5(1.6) 65.8(4.4) 73.3(5.4)
Monk 1 94.0(3.7) 90.0(4.1) 86.2(4.0) 77.3(8.0) 84.7(4.5) 82.0(5.1) 85.3(5.1) 86.7(4.7) 84.0(6.0)
Monk 2 96.0(5.5) 95.3(5.1) 90.6(2.4) 92.7(4.4) 94.0(4.4) 93.3(5.3) 93.3(5.3) 90.7(4.4) 90.3(5.3)
Pima 78.4(1.6) 76.4(1.4) 70.6(4.9) 69.5(5.0) 71.6(3.5) 70.7(4.3) 71.2(3.5) 70.4(3.5) 72.5(2.5)
142
5.5.3 UCI Repository
We apply the kernel learning approaches defined in this section to six data-sets
from the UCI repository [7]. In the Breast Cancer data-set, the task is to discriminate
two classes: no-recurrence and recurrence. The Ionosphere set is for the satellite
imaging detection of two classes (the presence or absence of structure) in the ground.
In the BUPA liver disorders set, a blood test with parameters are used to detect liver
disfunction. The goal of the Monk problem is to distinguish two distinct postures
of a robot. Monk 1 and 2 denote two alternative scenarios. Finally, the NIH Pima
For each data-set, we created five random partitions of the data, each with 80%
of the samples for training and the rest 20% for testing. The successful classification
rates on the above data-sets are shown in Tables 5.2 and 5.3. Both mean and stan-
dard deviation (in parentheses) are reported A paired t-test is used to check statistical
significance. The classification rate in bold is significantly higher than the others at
significance level 0.05. The proposed approach outperforms the other kernel learning
sus the typical transductive alternative is also favorable to the proposed approach.
In addition, our approach does not need to re-estimate the kernel matrix every time
scribed in the present paper defines a smaller kernel matrix, with smaller memory
requirements.
We also report the training time in kernel matrix learning for each algorithm in
Table 5.4. Since no training is needed in the algorithm of uniform kernel combination,
we do not include this algorithm in the comparison. From Table 5.4, we first observe
143
Table 5.4: Average training time (in seconds) of each algorithm in the UCI data-sets.
that all the algorithms with multiple kernels need more training time than those with
expensive and slower than our algorithm. Yet, our algorithm is slower than SKM in
these binary classification data-sets. However, we will see later that SKM becomes
converges. This is, of course, problem specific. Figure 5.8(a) and (b) plot the classi-
fication accuracy as a function of iterations for two of the databases used above. To
obtain these plots, we executed our approach 50 times. The figures show the mean
Another interesting question is how well the proposed optimization approach com-
pares to the traditional GA algorithm with crossover and mutation only. Moreover,
how do the proposed advanced GA operators help to improve the kernel matrix? To
see this, we present additional plots with the traditional GA algorithm and each of
the proposed operators only. First, in Figure 5.8(b) and (g), we see that the tra-
ditional GA algorithm can improve the classification accuracy as the kernel matrix
144
0.96 0.96
0.94 0.94
classification accuracy
classification accuracy
0.92 0.92
0.9 0.9
0.88 0.88
0.86 0.86
0.84 0.84
0 10 20 30 40 50 0 10 20 30 40 50
generations generations
(a) (b)
0.94 0.94 0.95
0.93 0.93
0.94
classification accuracy
classification accuracy
classification accuracy
0.92
0.92
0.93
0.91
0.91
0.9 0.92
0.9
0.89
0.91
0.89
0.88
0.88 0.9
0.87
0.74
0.74
classification accuracy
classification accuracy
0.72 0.72
0.7 0.7
0.68 0.68
0.66 0.66
0 10 20 30 40 50 0 10 20 30 40 50
generations generations
(f) (g)
0.72 0.72 0.72
classification accuracy
classification accuracy
Figure 5.8: Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization approaches applied
to KDA using monk1 database, and the plots from (f) to (j) were obtained with different
optimization approaches applied to SVM using breast cancer database. (a) and (f) show
the proposed genetic-based optimization approach. (b) and (g) show the traditional GA
algorithm with crossover and mutation only.145(c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only. (e) and (j) show
GA algorithm with insertion operator only.
Table 5.5: KDA Recognition rates (%) for large data-sets.
evolves. However, the final accuracies it obtains are lower than those obtained by the
proposed optimization approach. This means that the proposed additional operators
could further facilitate the optimization process and improve the classification per-
formance. From Figure 5.8(c)-(e) and (h)-(j), we see that each of the proposed new
operators can help optimizing the kernel matrix to improve the classification accu-
racy to some extent. For the same data-set, one operator may perform better than
the other, e.g., Figure 5.8(g) and (d). Some operator works better in one data-set
than another data-set, e.g., Figure 5.8(e) and (j). It is the combination of all of these
Our next experiment is on the PIE data-set of face images [86]. Here, the task
is to classify faces according to the identity of the individual shown in the image.
All face images were aligned with regard to the main facial features and resized to a
standard size of 3232 pixels, as in [60]. The results are in Table 5.5 and 5.6. In these
tables, N specifies the number of images per class used to train the kernel matrix.
146
Table 5.6: SVM Recognition rates (%) for large data-sets.
Table 5.7: Average training time (in seconds) of each algorithm in large data-sets.
147
The results are averaged by five random trials. As above, the proposed approach
We also used the Sitting Posture Distribution Maps (SPDM) data-set of [117]. In
this data-set, samples were collected using a chair equipped with a pressure sensor
sheet located on the sit-pan and back-rest of a chair. A total of 1,280 pressure values
from 50 individuals are provided from the pressure maps. There are five samples of
each of the ten different postures per individual. The goal is to classify each of the
samples into one of the ten sitting postures. We randomly selected 3 samples of each
posture and each individual for training, and used the rest for testing. The results
are then averaged by five trials. The results are shown in Table 5.5 and 5.6. The
We report the average training time of each algorithm in Table 5.7. We again
see that our algorithm is faster than transductive learning. Moreover, in this case,
our algorithm is also faster than SKM. This is because SKM can only learns a kernel
matrix with two classes. When there are multiple classes, we have to use one versus
one mechanism to extend to multi-class cases. Thus, the training time greatly depends
on the total number of classes. The more classes there are, the more training time
it takes. Whereas our algorithm can directly deal with the multi-class case, thus is
more efficient.
In this section, we give a detailed discussion of how the genetic operators help
to optimize the kernel matrix. First, note that each genome u on which the genetic
operators are directly applied is formed by concatenating all the entries of a matrix L,
148
where K = LT L, and K is the kernel matrix to be learned. Denote L = (l1 , l2 , ..., ln ),
This means that K(xi , xj ) = lTi lj . Thus, the changes in genome u will result in the
corresponding changes in the entries of the kernel matrix K. Now that we have this
interpretation, we can discuss how each genetic operator works to improve the kernel
matrix.
has been replaced (note that since the kernel matrix is symmetric, only lower off
149
[t+1]
then after crossover, the offspring Ki can improve the classification in the region
[t]
represented by this submatrix in Kj .
position of a genome. Note that our feature representation incorporates the non-
coding sequences into the genome, allowing a flexible length of the genome. Thus,
the insertion of a sequence corresponds to a local change of the kernel matrix. More
formally, suppose that the insertion of the sequence causes a change of a vector lq
[t]
in Li , then this will result in a change of the corresponding row and column in the
[t+1]
kernel matrix Ki , i.e.,
lT1 lq
..
.
[t+1]
Ki = lTq l1 . . . lTq lq . . . lTq ln . (5.16)
..
.
lTn lq
This change will affect the similarity between the q th sample and all the other samples
in the data. As a result, the local classification function is changed by insertion, which
could help to resolve the misclassification in a local region of the feature space.
of the genome, which leads to a corresponding local change of the kernel matrix. If
we again suppose the deletion will cause a change of a vector lq , similar to insertion
operator, this will result in a change of the corresponding row and column in the
such that the classification in a section of the feature space could be improved.
to a new position in the same genome. Suppose the transposon is from lp and copied
to lq . This will cause the change of the q th column and row of the kernel matrix. This
150
implies that a local classification function with good performance is now applied to
a new region in the feature space. If this improves classification, then the new kernel
to a new position in the same genome. Again, suppose the transposon is from lp and
moved to lq . This will cause the change of both pth and q th column and row of the
kernel matrix. This implies that a local classification function that does not work well
in one section of the feature space will now applied to a new section of the feature
space. If this improves classification, then the new kernel matrix will be selected over
We select 7 data-sets from the UCI machine learning [7] and the DELVE collections
[29].
In the Boston housing data-set, the task is to predict the median value of a home
price. The auto mpg set details fuel consumption predicted in terms of 3 discrete
and 4 continuous attributes. In the Normtemp set, the goal is to predict the heart
rate based on gender and body temperature of 130 people. The Airport set requires
prediction of the enplaned revenue in tons of mail. The task in the Puma-8nm is
And, the Kin problem requires us to predict the distance of the end-effector from a
Two cases with moderate and high amount of noise are considered, denoted Kin-8nm
and Kin-8nh.
151
For the first four data-sets, we randomly select 90% of the samples for training,
and use the rest for testing. This is repeated 10 times and the mean and standard
deviation of the errors are reported. The remaining databases have a larger number
of samples, allowing a random split into disjoint subsets. The first 1,024 samples
in each subset are used for training, while the others form the testing set. Again,
we report the mean and standard deviation of the errors of four splits. We use the
root mean squared error (RMSE) as our measure of the deviation between the true
Pn 1/2
response yi and the predicted response yi , i.e., RMSE = [n1 i=1 (yi yi )2 ] .
KRR and Support Vector Regression (SVR). In KRR, the parameters are selected
in SVR are selected by CV. A recent work [87] introduces the use of multiple kernels
into SVR, allowing a multiple kernel learning approach for regression by using semi-
infinite linear programming. Later, [76] shows how regression with multiple kernel
we compare to the approach in [76], and denote it MKL-SVR. Another work [13]
We also provide comparative results with transductive learning, the traditional GA,
and uniform kernel combination, denoted KRRT , KRRT R and KRRU . The results of
The regression performances of all algorithms are shown in Table 5.8. The pro-
We also show the training time in Table 5.9. We see that our algorithm takes com-
parable training time than the other two multiple kernel learning algorithm, i.e.,
152
Table 5.8: Mean and standard deviation of the RMSE.
Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
Housing 2.75(0.77) 2.66(0.46) 5.73(2.54) 2.93(0.90) 3.27(0.79) 3.35(1.08) 3.35(1.30) 3.11(1.09) 2.52(0.77) 2.53(0.84)
Mpg 2.24(0.26) 2.73(0.60) 2.96(0.50) 2.50(0.45) 2.62(0.28) 2.96(0.60) 3.01(0.66) 2.82(0.73) 2.76(0.35) 2.70(0.35)
Normtemp 5.56(1.15) 6.79(1.08) 7.24(1.40) 6.45(1.15) 7.00(0.79) 7.44(0.85) 7.35(1.30) 7.58(1.60) 7.85(0.80) 8.32(1.42)
Puma-8nm 1.40(0.02) 1.44(0.02) 3.11(0.02) 1.51(0.03) 1.62(0.02) 1.60(0.03) 1.44(0.03) 2.27(0.42) 1.70(0.04) 1.77(0.04)
Puma-8nh 3.52(0.06) 3.52(0.11) 4.18(0.11) 3.61(0.07) 3.56(0.08) 3.54(0.09) 3.46(0.13) 3.68(0.08) 3.72(0.07) 3.66(0.07)
Kin-8nm 0.10(0.003) 0.11(0.003) 0.16(0.002) 0.12(0.001) 0.14(0.002) 0.13(0.004) 0.11(0.002) 0.12(0.01) 0.10(0.003) 0.11(0.002)
Kin-8nh 0.18(0.003) 0.18(0.003) 0.21(0.004) 0.20(0.003) 0.19(0.004) 0.19(003) 0.19(0.005) 0.19(0.009) 0.18(0.004) 0.18(0.004)
MKL-SVR and MKL-KRR, but has an advantage that better prediction is achieved.
To conclude, we apply our approach to age estimation from images of face. Aging
process can induce significant changes in human facial appearances, which is generally
detectable in images. We used the FG-NET aging database of [2] to model these
changes. This database includes 1,002 face images of 82 subjects at different ages.
The ages range from 0 to 69. Face images include changes in illumination, pose,
expression and occlusion (e.g., glasses and beards). All images are warped to a
153
Table 5.10: MAE of the proposed approach and the state-of-the-art in age estimation.
Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
MAE 5.87(0.22) 5.95(0.31) 12.89(0.65) 6.31(0.30) 6.59(0.31) 13.83(0.79) 6.46(0.35) 7.18(0.46) 27.2(19.7) 8.05(0.40)
standard size of 60 60 pixels with all major facial features properly aligned, as in
[60]. We represent each image as a vector concatenating all the pixels of the image,
We generate five random partitions of the data, each with 800 images for training
and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. Again, We
can see that the proposed approach outperforms the other algorithms in predicting
5.6 Conclusions
operators transposition, insertion and deletion. These include viral infections that
result in DNA changes and yields an efficient search strategy within the vast space
of all possible kernel matrices. Regression is then used to estimate the underlying
mapping function given by the resulting kernel matrix, resolving the complexity is-
sues of transductive learning. We also extend the proposed kernel matrix learning
154
framework to work in regression. Comparative results against classical kernel meth-
ods demonstrate the superiority of the proposed approach. We have also shown fast
155
CHAPTER 6
6.1 Conclusions
Kernel methods have been extensively used in machine learning and shown to have
is how to determine the mapping model that leads to better learning and improved
model selection problems for kernel methods in pattern recognition and machine
for kernel optimization in the literature, but these are not directly related to the
idea of the Bayes optimal classifier in the kernel space, which is the classifier with the
smallest possible classification error. Our approaches are inspired by Bayes optimality
and we fully exploit this idea. In the first approach, we want to achieve the original
goal of the kernel mapping: the class distributions in the kernel space can be linearly
156
degree of homoscedasticity of the class distributions. Then, the kernel parameters
ity between the pairwise class distributions. This optimization enforces the linear
separability of the classes to the largest extent. To relax the single Gaussian distri-
bution assumption for each class, we use a mixture of Gaussians to define each class
and show that our criterion can be easily modified to adapt to this new modeling. We
also show how our approach can be efficiently employed using a quasi-Newton based
optimization technique.
mize the Bayes classification error in the kernel space over all the kernel mappings
to optimize the kernel parameters. This is plausible because different kernel presen-
tations result in different Bayes error. We first derive an effective measure which
approximates the Bayes accuracy (defined by one minus Bayes error) in the kernel
space, and then maximize this measure to find the optimal kernel parameters. We
further show how to employ our criterion to discover the underlying subclass divi-
both the effectiveness and efficiency of our methods over the state of the art.
gression approaches. Model selection in linear regression has been largely studied.
a good balance between the model fit and model complexity in a regression model.
duce both of them. If one is reduced, the other increases, and the vice versa. We
157
first derive measures for model fit and model complexity from a decomposition of the
generalization error of the learned function and show that balancing the two measures
ridge regression and kernel principal component regression, which are two popularly
show that the proposed approach performs generally better than other model selection
In kernel methods literature, the Gaussian RBF kernel is one of the most popularly
and successfully used kernels. In this kernel, the sample similarity is evaluated using
a fixed local window size. Thus, the estimation with over-fitting or under-fitting
problems may arise if the local data density changes. We introduce a new family of
kernels called Local Density Adaptive Kernels in Chapter 4. The window size of our
kernels can vary to adaptively fit the local data density, thus giving a better likelihood
evaluation. By implicitly changing the shape of our kernels, we show that our kernels
are Mercer kernels, and hence can be directly used in any kernel methods such as
Kernel Discriminant Analysis and Support Vector Machine. We then show that our
kernels outperform the fixed-shape kernels such as the RBF kernel and Mahalanobis
Thus far we only consider one single kernel function in kernel methods. In many
applications, the use of multiple kernel functions would be more appropriate since it
combines the characteristics of all kernels, leading to better learning. In the literature,
158
many approaches have been proposed to construct a linear or nonlinear combination
ter 5 by employing genetic algorithms. The main advantage of our method is that
there is no need to specify an explicit combination of multiple kernels, and the ker-
nel matrix can evolve during the generations using the genetic operators until the
genetic representation for each kernel matrix and present more advanced operators to
facilitate the optimization process. We then show how to learn a mapping function
represented by the learned kernel matrix to generalize to the test data. We applied
ods, i.e., model selection. This problem directly determines the performance of ker-
nel methods. Another important problem is the computational cost of these kernel
methods. This involves both computational time and memory. For a data-set with n
least several n n matrices to be stored in the memory, which needs a large amount
of memory space when n is large. Since the size of the real world data is commonly
huge, if we want to apply the kernel methods to such data, we need to find some
way to reduce the computational cost in order to make it work efficiently in practice.
159
One possible solution is to define some sparse learning techniques. For example, the
learning model could be represented by a smaller size of the data, i.e., to obtain a rep-
resentative subset of the data during learning. This could be extremely useful when
high redundancy exists in the data. We can also explore how our model selection
far, we only consider classification and regression. There are many other useful ap-
plications such as data clustering, manifold learning, ranking, etc. Since the goal
different model selection methods are needed for each specific application domain.
160
BIBLIOGRAPHY
[1] S. Abe. Training of support vector machines with mahalanobis kernels. In Proc.
International Conference on Artificial Neural Networks, pages 571576, 2005.
[3] E. E. Andersen and A. D. Andersen. The mosek interior point optimizer for
linear programming: An implementation of the homogeneous algorithm. High
Performance Optimization, pages 197232, 2002.
[8] L. Bregman. The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR
Comp. Mathematics and Mathematical Physics, 7:200217, 1967.
161
[11] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple
parameters for support vector machines. Machine Learning, 46(1-3):131159,
2002.
[12] B. Chen, L. Yuan, H. Liu, and Z. Bao. Kernel subclass discriminant analysis.
Neurocomputing, 2007.
[19] F. De la Torre and O. Vinyals. Learning kernel expansions for image classifica-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pages 17, 2007.
[20] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal
Machine Learning Research, 7:130, 2006.
[22] A. Desai, H. singh, and V. Pudi. Gear: Generic, efficient, accurate knn-based
regression. In Intl Conf on Knowledge Discovery and Information Retrieval,
2010.
162
[24] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.
Annals of Statistics, 32(2):407499, 2004.
[26] G. Fan and J. Gray. Regression tree analysis using target. Journal of Compu-
tational and Graphical Statistics, 14(1):113, 2005.
[34] T. Glasmachers and C. Igel. Maximum likelihood model selection for 1-norm
soft margin svms with multiple parameters. IEEE Transactions on Pattern
Analysis and Machine Intelligenc, 32(8):2010, 1522-1528.
[35] C. Gold and P. Sollich. Model selection for support vector machine classification.
Neurocomputing, 55:221249, 2003.
163
[38] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In Proc. Inter-
national Conference on Machine Learning, 2008.
[39] Y. Y. Haimes, L. S. Lasdon, and D. A. Wismer. On a bicriterion formulation
of the problems of integrated system identification and system optimization.
IEEE Transactions on Systems, Man, and Cybernetics, pages 296297, 1971.
[40] O. C. Hamsici and A. M. Martinez. Bayes optimality in linear discriminant
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30:647657, 2008.
[41] O. C. Hamsici and A. M. Martinez. Rotation invariant kernels and their appli-
cation to shape analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2009.
[42] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer-Verlag (2nd Edition), New York, NY, 2001.
[43] X. He and P. Niyogi. Locality preserving projections. In Proc. Advances in
Neural Information Processing Systems 16, 2004.
[44] L. Holmstrom and P. Koistinen. Using additive noise in back-propagation train-
ing. IEEE Transactions on Neural Networks, 3(1):2438, 1992.
[45] E. Hu, S. Chen, D. Zhang, and Yin X. Semisupervised kernel matrix learning by
kernel propagation. IEEE Transactions on Neural Networks, 21(11):18311841,
2010.
[46] T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to
detect remote protein homologies. In Proc. Internation Conference on Intelli-
gent Systems for Molecular Biology, pages 149158, 1999.
[47] N. Karmarkar. A new polynomial time algorithm for linear programming. Com-
binatorica, 4(4):373395, 1984.
[48] V. Katkovnik and I. Shmulevich. Kernel density estimation with varying win-
dow size. Pattern Recognition Letters, 23:16411648, 2002.
[49] S.J. Kim, A. Magnani, and S. Boyd. Optimal kernel selection in kernel fisher
discriminant analysis. In Int. Conf. Machine Learning, pages 465472, 2006.
[50] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. lp norm multiple kernel
learning. Journal of Machine Learning Research, 12:953997, 2011.
[51] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan.
Learning the kernel matrix with semidefinite programming. Journal of Machine
Learning Research, 5:2772, 2004.
164
[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of IEEE, 92(11):22782324, 1998.
[53] B. Leibe and B. Schiele. Analyzing appearance and contour based methods
for object categorization. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, 2003.
[54] J. Liu, J. Chen, S. Chen, and J. Ye. Learning the optimal neighborhood kernel
for classification. In International Joint Conference on Artificial Intelligence,
Pasadena, California, 2009.
[62] A. M. Martinez and O. C. Hamsici. Who is LB1? discriminant analysis for the
classification of specimens. Pattern Rec., 41:34363441, 2008.
[63] A. M. Martinez and M. Zhu. Where are linear feature extraction methods
applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(12):19341944, 2005.
165
[64] B. McClintock. The origin and behavior of mutable loci in maize. In Proceedings
of the National Academy of Sciences of the USA, volume 36, pages 344355,
1950.
[65] G. McLachlan and K. Basford. Mixture Models: Inference and applications to
clustering. Marcel Dekker, 1988.
[66] K. Miettinen. Nonlinear Multiobjective Optimization, volume 12 of Interna-
tional Series in Operations Research and Management Science. Kluwer Aca-
demic Publishers, Dordrecht, 1999.
[67] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller. Fisher discriminant
analysis with kernels. In Proc. IEEE Neural Networks for Signal Processing
Workshop, pages 4148, 1999.
[68] M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996.
[69] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A kullback-leibler divergence based
kernel for svm classification in multimedia applications. In Advances in Neural
Information Processing Systems, 2003.
[70] E. A. Nadaraya. On estimating regression. Theory of Probability and its Appli-
cations, 9:141142, 1964.
[71] M. H. Nguyen and F. De la Torre. Robust kernel principal component analysis.
In Advances in Neural Information Processing Systems, 2008.
[72] F. Odone, A. Barla, and A. Verri. Building kernels from binary strings for image
matching. IEEE Transactions on Image Processing, 14(2):169180, 2005.
[73] E. Parzen. On estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33:10651076, 1962.
[74] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for
predictivity in learning theory. Nature, 428:419422, 2004.
[75] O. Pujol and D. Masip. Geometry-based ensembles: Towards a structural char-
acterization of the classification boundary. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 31(6):11401146, 2009.
[76] S. Qiu and T. Lane. A framework for multiple kernel support vector regression
and its applications to sirna efficacy prediction. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 6(2):190199, 2009.
[77] Y. Radhika and M. Shashi. Atmospheric temperature prediction using support
vector machines. International Journal of Computer Theory and Engineering,
1(1):5558, 2009.
166
[78] C. R. Rao. The utilization of multiple measurements in problems of biological
classification. J. Royal Statistical Soc., B, 10:159203, 1948.
[79] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. the
MIT Press, 2006.
[82] B. Scholkopf. The kernel trick for distances. In Advances in Neural Information
Processing Systems, pages 301307, 2000.
[84] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
[85] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cam-
bridge University Press, 2004.
[86] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression
(PIE) database. In Proceedings of the 5th IEEE International Conference on
Face and Gesture Recognition, 2002.
[89] G. Terrell and D. Scott. Variable kernel density estimation. The Annals of
Statistics, 20(3):12361265, 1992.
[90] C. M. Theobald. An inequality for the trace of the product of two symmetric
matrices. Proceedings of the Cambridge Philosophical Society, 77:256267, 1975.
[91] M. E. Tipping. Sparse bayesian learning and the relevance vector machine.
Journal of Machine Learning Research, (1):211244, 2001.
[92] V. Vapnik. The Nature of Statistical Learning Theory. New York: Springer,
1995.
[93] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
167
[94] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function
approximation, regression estimation, and signal processing. In InM.Mozer,M.
Jordan, T. Petsche (Eds.), Advances in neural information processing systems,
9, The MIT Press, Cambridge, MA, 1996.
[95] M. Varma and B. R. Babu. More generality in efficient multiple kernel learning.
In Proc. International Conference on Machine Learning, pages 465472, 2009.
[96] Grace Wahba. Spline Models for Observational Data. Society for Industrial and
Applied Mathematics, 1990.
[98] L. Wang, K.L. Chan, P. Xue, and L.P. Zhou. A kernel-induced space selection
approach to model selection in klda. IEEE Trans. Neural Networks, 19:2116
2131, 2008.
[99] S. Wang, W. Zhu, and Z. Liang. Shape deformation: Svm regression and
application to medical image segmentation. In Proceedings of International
Conference on Computer Vision, 2001.
[100] Yong Wang. A New Approach to Fitting Linear Models in High Dimensional
Spaces. PhD dissertation, University of Waikato, 2000.
[105] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Jour-
nal of Machine Learning Research, 4:913931, 2003.
168
[106] G. Wu and E. Chang. Adaptive feature-space conformal transofrmation for
imbalanced-data learning. In Proc. International Conference on Machine Learn-
ing, pages 816823, 2003.
[108] H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the kernel in the
empirical feature space. IEEE Transactions on Neural Networks, 16(2):460
474, 2005.
[109] Jian Yang, Alejandro F. Frangi, Jing-yu Yang, David Zhang, and Zhong Jin.
KPCA plus LDA: A complete kernel fisher discriminant framework for feature
extraction and recognition. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 27(2):230244, 2005.
[110] Ming-Hsuan Yang. Kernel eigenfaces vs. kernel fisherfaces: Face recognition
using kernel methods. In Proc. IEEE International Conference on Automatic
Face and Gesture Recognition, 2002.
[111] J. Ye, S. Ji, and J. Chen. Multi-class discriminant kernel learning via convex
programming. J. Machine Lear. Res., 9:719758, 2008.
[112] D. Yeung, H. Chang, and G. Dai. Learning the kernel matrix by maximizing a
kfd-based class separability criterion. Pattern Recognition, 40:20212028, 2007.
169