Osu 1322581224

Model Selection in Kernel Methods
Dissertation
Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Di You, M.S.
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2011
Dissertation Committee:
Aleix M. Martinez, Adviser
Yuan F. Zheng
Yoonkyung Lee
c Copyright by

Di You
2011
ABSTRACT
Kernel methods have been extensively studied in pattern recognition and machine
learning over the last decade, and they have been successfully used in a variety of
applications. A main advantage of kernel methods is that nonlinear problems such as
classification and regression can be efficiently solved using classical linear approaches.
The performance of kernel methods greatly depends on the selected kernel model. The
model is defined by the kernel mapping and its parameters. Different models result in
different generalization performance. Hence, model selection in kernel methods is an
important problem and remains a challenge in the literature. In this dissertation, we
propose several approaches to address this problem. Our approaches can determine
good learning models by optimizing both the kernels and all other parameters in the
kernel-based algorithms.
In classification, we develop an algorithm yielding class distributions that are
linearly separable in the kernel space. The idea is to enforce the homoscedasticity
and separability of the pairwise class distributions simultaneously in the kernel space.
We show how this approach can be employed to optimize kernels in discriminant
analysis. We then derive a criterion to search for a good kernel representation by
directly minimizing the Bayes classification error over different kernel mappings.
In regression, we derive a model selection approach to directly balance the model
fit and model complexity using the framework of multiobjective optimization. We
ii
develop an algorithm to obtain the Pareto-optimal solutions which balance the trade-
off between the model fit and model complexity. We show how the proposed method
is related to minimizing the predicted generalization error of the learning function.
In our final algorithm, the kernel matrix is recursively learned with genetic algo-
rithms until the classification/prediction error falls below a threshold. We derive a
family of adaptive kernels to better fit the data with various densities and show their
superiority over the commonly used fixed-shape kernels.
Extensive experimental results demonstrate that the proposed approaches are su-
perior to the state of the art.
iii
To my parents and my wife
iv
ACKNOWLEDGMENTS
First of all, I greatly thank my advisor, Dr. Aleix M. Martinez for his guidance,
support and patience to my PhD work. I have learned a lot from him including the
rigorous scientific attitude, methods to do good research, and the spirit of a researcher.
He guides me towards the completion of this dissertation and my PhD study.
I also would like to thank all my friends and my labmates: Onur Hamsici, Hongjun
Jia, Liya Ding, Paulo Gotardo, Samuel Riveras, Fabian Benitez-Quiroz, Shichuan Du,
Yong Tao, and Felipe Giraldo. I benefit a lot from the many discussions with them
and really have a good time in the lab.
Finally, but most importantly, I want to thank my parents and my wife. It is my
parents who give me the endless love, care, and support so that I can finish this long
and difficult process. I am also grateful to my wife for her love and encouragement
accompanying me throughout this process.
This research was partially supported by the US National Institutes of Health
under grant R21 DC 011081 and R01 EY 020834.
v
VITA
July 27, 1984 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Jiamusi, Heilongjiang, China
2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,

Harbin Institute of Technology, China
2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Department of Electrical and
Computer Engineering,
The Ohio State University, USA
2007-2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate,
Department of Electrical and Com-
puter Engineering,
The Ohio State University, USA
PUBLICATIONS
Research Publications
D. You, O. C. Hamsici and A. M. Martinez. Kernel Optimization in Discriminant

Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33,
no. 3, pp. 631-638, 2011.
D. You and A. M. Martinez. Bayes Optimal Kernel Discriminant Analysis. in

Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3533-3538,
2010.
D. You and A. M. Martinez. Multiobjective Optimization for Model Selection in

Kernel Methods in Regression. submitted to IEEE Transactions on Pattern Analysis
and Machine Intelligence.
vi
D. You and A. M. Martinez. Kernel Matrix Learning with Genetic Algorithm.
submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.
D. You and A. M. Martinez. Local Density Adaptive Kernels. submitted to IEEE

Transactions on Neural Networks.
FIELDS OF STUDY
Major Field: Electrical and Computer Engineering
Studies in Pattern Recognition and Computer Vision: Prof. Aleix M. Martinez
vii
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Kernel parameter selection . . . . . . . . . . . . . . . . . . 10
1.2.2 Kernel matrix learning . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 New kernel development . . . . . . . . . . . . . . . . . . . . 16
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 17
2. Kernel Learning in Discriminant Analysis . . . . . . . . . . . . . . . . . 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The metrics of discriminant analysis . . . . . . . . . . . . . . . . . 24
2.3 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Maximizing homoscedasticity . . . . . . . . . . . . . . . . . 30
2.3.2 Derivation of the Gradient . . . . . . . . . . . . . . . . . . . 38
2.3.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
2.4 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Bayes accuracy in the kernel space . . . . . . . . . . . . . . 43
2.4.2 Kernel parameters with gradient ascent . . . . . . . . . . . 45
2.4.3 Subclass extension . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.4 Optimal subclass discovery . . . . . . . . . . . . . . . . . . 47
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . 50
2.5.2 KBA criterion . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3. Model Selection in Kernel Methods in Regression . . . . . . . . . . . . . 62
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.1 Generalization error . . . . . . . . . . . . . . . . . . . . . . 66
3.2.2 Model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.3 Roughness penalty in RBF . . . . . . . . . . . . . . . . . . 70
3.2.4 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.5 Comparison with other complexity measure . . . . . . . . . 73
3.3 Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Pareto-Optimality . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.2 The -constraint approach . . . . . . . . . . . . . . . . . . . 77
3.3.3 The modified -constraint . . . . . . . . . . . . . . . . . . . 79
3.3.4 Alternative Optimization Approaches . . . . . . . . . . . . . 83
3.4 Applications to Regression . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . 84
3.4.2 Kernel Principal Component Regression . . . . . . . . . . . 86
3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.1 Standard data-sets . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.2 Comparison with the state of the art . . . . . . . . . . . . . 92
3.5.3 Alternative Optimizations . . . . . . . . . . . . . . . . . . . 95
3.5.4 Comparison with the L2 norm . . . . . . . . . . . . . . . . . 95
3.5.5 Age estimation . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.6 Weather prediction . . . . . . . . . . . . . . . . . . . . . . . 97
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4. Local Density Adaptive Kernels . . . . . . . . . . . . . . . . . . . . . . . 101
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 Local Density Adaptive Kernels . . . . . . . . . . . . . . . . . . . . 104
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.2 Defining Mercer kernels . . . . . . . . . . . . . . . . . . . . 105
ix
4.2.3 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2.4 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Kernel Parameter Selection . . . . . . . . . . . . . . . . . . . . . . 112
4.3.1 k-fold cross-validation . . . . . . . . . . . . . . . . . . . . . 112
4.3.2 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . 112
4.4.1 UCI benchmark data-sets . . . . . . . . . . . . . . . . . . . 113
4.4.2 Image databases . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5. Kernel Matrix Learning with Genetic Algorithms . . . . . . . . . . . . . 119
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2 Learning with Genetic Algorithms . . . . . . . . . . . . . . . . . . 123
5.2.1 Feature representation . . . . . . . . . . . . . . . . . . . . . 123
5.2.2 Basic operators . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2.3 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.4 Deletion and insertion . . . . . . . . . . . . . . . . . . . . . 128
5.2.5 Selection criterion . . . . . . . . . . . . . . . . . . . . . . . 130
5.3 Generalizing to Test Samples in Classification . . . . . . . . . . . . 132
5.4 Kernel Matrix Learning in Regression . . . . . . . . . . . . . . . . 134
5.5.1 A toy example . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5.2 Classification algorithms . . . . . . . . . . . . . . . . . . . . 139
5.5.3 UCI Repository . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.5.4 Large databases . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5.5 Discussions of the genetic operators . . . . . . . . . . . . . . 148
5.5.6 Application to regression . . . . . . . . . . . . . . . . . . . 151
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6. Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
x
LIST OF TABLES
Table Page
2.1 Recognition rates (in percentages) with nearest mean . . . . . . . . . 54
2.2 Recognition rates (%) with nearest neighbor . . . . . . . . . . . . . . 55
2.3 Recognition rates (%) with the smooth nearest-neighbor classifier . . 55
2.4 Recognition rates (%) with linear SVM . . . . . . . . . . . . . . . . . 56
2.5 Training time (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Recognition rates (%) with nearest neighbor. Bold numbers specify the
top recognition obtained with the three criteria in KSDA and KDA.
An asterisk specifies a statistical significance on the highest recognition
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.7 Recognition rates (%) with the classification method of [75]. . . . . . 58
2.8 Recognition rates (%) with linear SVM. . . . . . . . . . . . . . . . . . 59
3.1 Results for KRR. Mean RMSE and standard deviation (in parentheses). . 89
3.2 Results for KPCR. Mean RMSE and standard deviation (in parentheses). . 89
3.3 Mean and standard deviation of RMSE of different methods. . . . . . . . 93
3.4 Comparison of our results with the state of the art. . . . . . . . . . . . . 93
3.5 Regression performance with alternative optimization criteria. . . . . . . . 94
3.6 Comparison with L2 norm. . . . . . . . . . . . . . . . . . . . . . . . . . 95
xi
3.7 MAE of the proposed approach and the state of the art in age estimation. 97
3.8 RMSE of several approaches applied to weather prediction. . . . . . . . . 98
4.1 Recognition rates (%) with CV in UCI data-sets. . . . . . . . . . . . 114
4.2 Recognition rates (%) with KBA criterion in UCI data-sets. . . . . . 115
4.3 Recognition rates (%) with CV in ETH-80. . . . . . . . . . . . . . . . 116
4.4 Recognition rates (%) with KBA criterion in ETH-80. . . . . . . . . . 116
4.5 Recognition rates (%) with CV in PIE database. . . . . . . . . . . . . 117
4.6 Recognition rates (%) with KBA criterion in PIE database. . . . . . . 118
5.1 The parameters used in the experiments . . . . . . . . . . . . . . . . 141
5.2 KDA Recognition rates (in percentages) in the UCI data-sets. . . . . 142
5.3 SVM Recognition rates (%) in the UCI data-sets. . . . . . . . . . . . 142
5.4 Average training time (in seconds) of each algorithm in the UCI data-sets.144
5.5 KDA Recognition rates (%) for large data-sets. . . . . . . . . . . . . 146
5.6 SVM Recognition rates (%) for large data-sets. . . . . . . . . . . . . . 147
5.7 Average training time (in seconds) of each algorithm in large data-sets. 147
5.8 Mean and standard deviation of the RMSE. . . . . . . . . . . . . . . 153
5.9 Average training time (in seconds) of each algorithm. . . . . . . . . . 153
5.10 MAE of the proposed approach and the state-of-the-art in age estimation. 154
xii
LIST OF FIGURES
Figure Page
1.1 This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can
be mapped to a higher dimensional space where the data becomes linearly
separable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Here we show an example of two non-linearly separable class distributions,

each consisting of 3 subclasses. (a) Classification boundary of LDA. (b)
SDAs solution. (c) KDAs solution. . . . . . . . . . . . . . . . . . . . . 8
1.3 Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample
similarity on x is determined by the nearby samples of x. (b) The polynomial
kernel. d is the degree of the kernel. The kernel value measuring the sample
similarity on x is determined by all the samples. . . . . . . . . . . . . . 9
2.1 Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale
and rotation. (a) The value of Q1 decreases as the angle increases. The
2D rotation between the two distributions is in the x axis. The value of Q1
is in the y axis. (b) When = 0o , the two distributions are homoscedastic,
and Q1 takes its maximum value of .5. Note how for distributions that are
close to homoscedastic (i.e., 0o ), the value of the criterion remains high.
(c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3. 33
2.2 Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions
that have small training error. (c) However, when the model complexity is
small, = 3, KDA fails. (d) KSDAs solution resolves this problem with
piecewise smooth, nonlinear classifiers. . . . . . . . . . . . . . . . . . . . 41
xiii
2.3 The original data distributions are mapped to different kernel spaces via
different mapping functions (.). (2 ) is better than (1 ) in terms of the
Bayes error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4,
and (d,h) 5. The x-axis specifies the number of subclasses Hi . The y-axis
shows the value of the criterion given in (2.12) in (a-d) and of the Fisher
criterion in (e-h). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion. . . . . . . . . . . . . . . . . . 49
2.6 Shown here are (a) 8 categories in ETH-80 database and (b) 10 different
objects for the cow category. . . . . . . . . . . . . . . . . . . . . . . . 51
2.7 Plots of the value of the derived criterion as a function of the kernel param-
eter and the number of subclasses. From left to right and top to bottom:
AR, ETH-80, Monk 1, and Ionosphere databases. . . . . . . . . . . . . . 60
3.1 The two plots in this figure show the contradiction between the RSS and
the curvature measure with respect to: (a) the kernel parameter , and (b)
the regularization parameter in Kernel Ridge Regression. The Boston
Housing data-set [7] is used in this example. Note that in both cases, while
one criterion increases, the other decreases. Thus, a compromise between
the two criteria ought to be determined. . . . . . . . . . . . . . . . . . . 72
3.2 Here we show a case of two objective functions. u(S) represents the set
of all the objective vectors with the Pareto frontier colored in red. The
Pareto-optimal solution can be determined by minimizing u1 given that
u2 is upper-bounded by . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Comparison between the proposed modified and the original -constraint
methods. We have used * to indicate the objective vector and o to spec-
ify the solution vector. Solutions given by (a) the -constraint method
and (b) the proposed modified -constraint approach on the first exam-
ple, and (c) the -constraint method and (d) the modified -constraint ap-
proach on the second example. Note that the proposed approach identifies
the Pareto-frontier, while the original algorithm identifies weakly Pareto-
solutions, since the solution vectors go beyond the Pareto-frontier. . . . . 82
xiv
3.4 Sample images showing the same person at different ages. . . . . . . . . . 97
3.5 This figure plots the estimated (lighter dashed curve) and actual (darker
dashed curve) maximum daily temperature for a period of more than 200
days. The estimated results are given by the algorithm proposed in this
chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1 A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density
Adaptive (LA) kernels are evaluated on the four points marked by . (a)
Density estimation in the RBF kernel uses a fixed window, illustrated by
black circles. Note that this fixed window cannot capture different local
densities. (b) Density estimation with the proposed LA kernel. . . . . . . 102
4.2 This figure illustrates how the local variance measurement given by (4.7)
is used. The axis represents the magnitude of the variance around each
sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under dif-
ferent covariance factors c. The proposed kernel obtains higher classification
accuracies than the RBF as c increases. . . . . . . . . . . . . . . . . . . 111
4.4 Shown here are sample images from PIE data-set. . . . . . . . . . . . . 116
5.1 (a) The classical feature representation. Each entry in the feature vector
codes for a relevant variable in the optimization problem. (b) The proposed
feature representation. Each individual in the population is represented as a
feature vector with coding and non-coding segments. The lower case letters
represent the coding (or gene) sequence used for the calculation of the fitness
function. Consecutive N labels indicate non-coding DNA. . . . . . . . . 124
5.2 This figure illustrates the copy-and-paste transposition. . . . . . . . . . 126
5.3 This figure illustrates the cut-and-paste transposition. . . . . . . . . . . 128
5.4 This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is
formed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
xv
5.5 (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the
classification accuracy over the number of generations. . . . . . . . . . . 136
5.6 In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations. . . . . . . . . . . . . . . . . . . . 137
5.7 This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations. . . . . . . . . . . . . . . . . . . . . 139
5.8 Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization
approaches applied to KDA using monk1 database, and the plots from (f)
to (j) were obtained with different optimization approaches applied to SVM
using breast cancer database. (a) and (f) show the proposed genetic-based
optimization approach. (b) and (g) show the traditional GA algorithm with
crossover and mutation only. (c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only.
(e) and (j) show GA algorithm with insertion operator only. . . . . . . . 145
xvi
CHAPTER 1
INTRODUCTION
The goal of Pattern recognition is to describe, recognize, classify, and group pat-
terns of interest. While it seems an easy task for humans, such as identifying a
person and recognizing different objects, it is very challenging to teach computers to
recognize patterns.
Over the decades, extensive research has been conducted in this field and a number
of approaches for pattern recognition have been developed. These pattern recogni-
tion techniques have been widely used in a variety of fields such as computer vi-
sion, artificial intelligence, bioinformatics, psychology and paleontology. Well-known
applications are in face recognition and verification, automated speech recognition,
fingerprint identification, DNA sequence analysis to name but a few.
Since many common pattern recognition algorithms are probabilistic in nature,
statistical pattern recognition approaches have been most intensively studied and em-
ployed in practice [32]. In such approaches, each pattern is represented in terms of
d features and is viewed as a point in a d-dimensional vector space. Patterns are
assumed to be generated by a probabilistic model, and statistical concepts and ap-
proaches are employed to build the decision boundary or model the distribution of the
data. Depending on whether the training samples are labeled or unlabeled, statistical
1
pattern recognition can be divided into two categories: supervised and unsupervised
methods.
In supervised learning, the goal is to predict a functional relationship between
the objects and their associated labels. If the labels are discrete, the correspond-
ing problem is a classification problem. Well-known approaches for classifiction are
Linear Discriminant Analysis (LDA) [32] and Support Vector Machines (SVMs) [93].
If the labels are continuous, we talk about regression. The least-squares solutions
and their variants [42] (e.g. ridge regression) are popular approaches for regression.
Unsupervised learning seeks to determine how the data are organized. Data represen-
tation (e.g., principal component analysis) and clustering (e.g., k-means) are typical
examples in this class of approaches.
In this dissertation, we focus on the supervised learning approaches.
Among the many approaches to supervise learning that have been developed thus
far, Discriminant Analysis (DA) is one of the earliest and most used techniques in
pattern recognition. It has been used for feature extraction and classification with
broad applications in, for instance, computer vision [32], gene expression analysis [23]
and paleontology [62]. In his ground-breaking work, Fisher [27, 28] derived a DA
approach for the two Normally distributed class problem, N(1 , 1 ) and N(2 , 2 ),
under the assumption of equal covariance matrices, 1 = 2 . Here, i and i are the
mean feature vector and the covariance matrix of the ith class, and N(.) represents the
Normal distribution. The assumption of identical covariances (i.e., homoscedasticity)
implies that the Bayes (optimal) classifier is linear, which is the reason why we refer
to this algorithm as Linear Discriminant Analysis (LDA). LDA thus provides the
2
one-dimensional subspace where the Bayes classification error is the smallest in the
2-class homoscedastic problem.
Fishers work is later extended to solve the multi-class classification problem in
a least-squares framework [78]. In this solution, LDA employs two symmetric, posi-
tive semi-definite matrices, each defining a metric [63]. One of these metrics should
measure within-class differences and, as such, should be minimized. The other metric
should account for between-class dissimilarity and should thus be maximized. Classi-
cal choices for the first metric are the within-class scatter matrix SW and the sample
covariance matrix X , while the second metric is usually given by the between-class
scatter matrix SB . The sample covariance matrix is defined as

n
X
X = n 1
(xi ) (xi )T , (1.1)
i=1
Pn
where X = {x1 , . . . , xn } are the n training samples, xi Rp , and = n1 i=1 xi is
the sample mean. The between-class scatter matrix is given by

C
X
SB = pi (i ) (i )T , (1.2)
i=1
P ni
where i = n1
i j=1 xij is the sample mean of class i, xij is the j th sample of class i,
ni is the number of samples in that class, C is the number of classes, and pi = ni /n
is the prior of class i. LDAs solution is then given by the generalized eigenvalue
decomposition equation 1
X SB V = V, where the columns of V are the eigenvectors,
and is a diagonal matrix of corresponding eigenvalues. Thus, the solution of LDA
indicates a C1 dimensional subspace where the between-class scatters are maximized
and the within-class scatters are minimized.
The idea of LDA is attractive, because we could obtain a linear classifier with the
smallest classification error (also known as the Bayes error) provided the fact that
3
the class distributions are single Gaussians and the covariance matrices are identical.
However, in practice, the class distributions can be highly non-Gaussian and distinct
from each other, which makes the assumption of LDA so restrictive. In other words, if
the real data distributions deviate from this underlying assumption, then LDA would
not work well. This becomes the major drawback of LDA.
To relax this assumption, numerous approaches have been proposed in the litera-
ture. Loog and Duin [57] defines a within-class similarity metric using the Chernoff
distance which incorporates the differences of both means and covariance matrices,
yielding an algorithm which can handle heteroscedastic (i.e., non-homoscedastic) dis-
tributions. Another way is to allow each class to be divided into several subclasses
by imposing a mixture of Gaussians for each class distribution. This is the underly-
ing idea of subclass DA (SDA) [116]. Since a mixture of Gaussians is more flexible
to model the underlying class distributions than a single Gaussian, this approach is
shown to perform well for a variety of applications. To loosen the parametric restric-
tion of the above assumption, Fukunaga and Mantock [31] redefines the between-class
scatter matrix in a non-parametric fashion, and the decision boundary is constructed
locally. Specifically, a local classifier for each sample is first built based upon the
sample and its local k nearest neighbors, and then the final decision boundary is
constructed by combining all the local classifiers.
The classifiers obtained from the above approaches are linear or piecewise linear.
However, such classifiers may not be adequate for a classification problem with a
highly nonlinear decision boundary. This is because that the features in such ap-
proaches are extracted from a linear combination of the features in the original space.
To derive a nonlinear classifier, a nonlinear combination of the original features would
4
be more appropriate. Recently, kernel methods have been developed to tackle the
nonlinear problem.
1.1 Kernel methods
Kernel methods have attracted great interest over the past decade and have been
shown its promise in performing nonlinear feature extraction and classification [84,
93]. The idea is to use a kernel function which maps the original nonlinearly separable
data to a very high or even infinite dimensional space where the data is linearly
separable, see Figure 1.1. Then, any efficient linear classification approach can be
employed in this so-called kernel space. Since the mapping is intrinsic, one does not
need to work with an explicit mapping function. Instead, one can employ the kernel
trick [84], allowing nonlinear formulations to be cast in terms of inner products. This
will result in a space of the same dimensionality as that of the input representation
while still eliminating the nonlinearity of the data.
Formally, suppose a training data-set {(xi , yi )}ni=1 is given, where xi Rp is the
ith observation and yi is the corresponding label of xi , yi R in regression and
yi {1, 1} in classification. In general, a function f (x) is built to model the
functional relationship between x and y (Note that in classification, the class label is
obtained by sgn (f (x)), where sgn(.) is the sign function.). f (x) can be modeled as
a linear function of x, i.e.,
f (x) = wT x + b, (1.3)
where w Rp is a weight vector and b R is an offset. However, this linear model
fails to capture the nonlinearity that usually exists in the data. In this case, kernel
methods can be used to model a nonlinear function.
5
Figure 1.1: This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can be mapped to
a higher dimensional space where the data becomes linearly separable.
Let (.) : Rp F be a function defining a kernel mapping which maps the data
in the original space to the kernel space defined by F . Then (1.3) can be rewritten
as
T
f (x) = w (x) + b, (1.4)
where w is the weight vector in the kernel space. Unfortunately, the dimensionality
of F may be too large, which makes it difficult to work with the explicit features
in the kernel space. To bypass this problem, the kernel trick [84] is generally used.
Specifically, from the Representers Theorem [96], the weight vector w can be defined
as a linear combination of the samples in the kernel space (X) with the coefficient
vector , i.e.,
w = (X), (1.5)
6
where (X) = ((x1 ), ..., (xn )) and Rn . Substituting (1.5) into (1.3), we get
f (x) =T (X)T (x) + b
=T k(x) + b
n
X
= i hxi , xi + b, (1.6)
i=1
where hxi , xi is the inner product of (xi ) and (x), i.e., hxi , xi = (xi )T (x). We
thus see that the model f (x) just derived is linear in the kernel space but nonlinear
in the original one. Therefore, by specifying an appropriate kernel mapping function,
the nonlinearity of the original data is eliminated and a linear approach can be used
in the kernel space to efficiently solve the problem.
There are a variety of algorithms in kernel methods. Kernel Discriminant Analysis
(KDA) [5, 67] is one of the most used kernel methods. KDA is a kernel extension
of LDA. It aims to simultaneously maximize the between-class scatter and minimize
the within-class scatter of the data in the kernel space. Ideally, if the kernel function
and associated parameters are set appropriately, the class distributions will become
homoscedastic in the kernel space and the smallest classification error (i.e., Bayes
error) can be obtained from the resultant linear Bayes classifier. The performance of
a KDA classifier is illustrated in Fig. 1.2.
Kernel Support Vector Machine (KSVM) [92] is another kernel approach popularly
used in pattern recognition. Unlike the DA-based approach, KSVM does not make
any assumptions on the underlying class distributions. Instead, it is a discriminative
approach which directly maximizes the margin between the samples defining the two
classes. In general, the larger the margin, the better the generalization performance
is. This is supported by the principle of structure risk minimization [92].
7
(a) (b) (c)
Figure 1.2: Here we show an example of two non-linearly separable class distributions,
each consisting of 3 subclasses. (a) Classification boundary of LDA. (b) SDAs solution. (c)
KDAs solution.
The kernel mapping is a key process in kernel-based approaches. Different kernel
mappings characterize different representations of the data distributions in the kernel
space, thus requiring different learning models. A kernel mapping can be specified
by a parameterized kernel function, and different kernel functions specify distinct
mappings to the kernel space. For instance, a Gaussian RBF kernel characterizes a
local mapping, whereas a polynomial kernel characterizes a global mapping, Figure
1.3. An appropriately selected kernel function may greatly improve the algorithm
performance. However, one usually does not have any prior knowledge of which
kernel should be selected given a problem at hand.
Even when a kernel function is determined, the process of selecting the parameters
of the kernel which map the original nonlinear problem to a linear one still remains
a big challenge. Kernel parameters play a significant role in the kernel mapping
process. Each kernel parameter specifies a model for the problem to be solved. Thus,
kernel parameter selection is also equivalent to a model selection problem. It is always
desirable that a model could achieve a good bias and variance trade-off, according to
8
(a) (b)
Figure 1.3: Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample similarity on x
is determined by the nearby samples of x. (b) The polynomial kernel. d is the degree of
the kernel. The kernel value measuring the sample similarity on x is determined by all the
samples.
the bias and variance decomposition [42]. If the the model is made too complex, an
over-fitting to the training data may occur, i.e., low bias and high variance. Whereas if
the model is made too simple, it may under-fit the data and will thus not effectively
capture the underlying structure of the data [42], i.e., high bias and low variance.
Unfortunately, without prior knowledge on the data, it is not easy to select good
kernel parameters. Therefore, model selection becomes a fundamental problem to be
solved in kernel methods.
In this dissertation, we give a comprehensive study of the model selection problem
in kernel methods and propose several novel approaches to address this problem. We
cast the problem into two typical scenarios: classification and regression. In the
section to follow, we give a literature review of the model selection approaches in
kernel methods.
9
1.2 Literature review
Model selection in kernel methods has been a very active and popular research
area. Kernel-based approaches are very powerful due to their high generalization per-
formance and efficiency using the kernel trick. Although promising, a main problem
cannot be circumvented, that is, how to learn a good kernel mapping to adapt to the
data at hand. In general, different kernel mappings lead to different generalization
performances.
In the literature, various approaches for kernel learning have been proposed. Gen-
erally, they can be divided into three classes. The first class of approaches is to learn
the kernel parameters given a parameterized kernel function. In the second class of ap-
proaches, a kernel matrix is directly learned without pre-specifying a kernel function,
and a positive semi-definiteness constraint has to be imposed. One typical approach
in this class is multiple kernel learning, where some basis kernels are first built and
then the final kernel is constructed as a linear or nonlinear combination of these basis
kernels. In the third class of approaches, instead of using some traditional kernel
function, some new kernel functions are proposed to specifically tackle the problem
at hand and expected to perform better. In the following, we will give a review for
each class of approaches in detail.
1.2.1 Kernel parameter selection
One of the most commonly used kernel parameter selection methods is cross-
validation (CV) technique [88, 42]. In this approach, the training data is divided into
k parts: (k 1) of these are used for training the algorithm with distinct values of the
parameters of the kernel, and the remaining one for validating which of these values
10
results in higher classification or prediction accuracy. This method has four major
drawbacks. First, it is computational expensive. The training stage has to be repeated
k times, and the parameter selection is based on an exhaustive search. Second, only
part of the training data is used in each fold. When doing model selection, one wants
to employ the largest possible number of training samples, since this is known to
yield better generalizations [63]. Third, it only selects the parameters from a set of
discrete values and a careful range of the parameters should be pre-specified. Finally,
the selection of k can be an issue, since it affects the trade-off between bias and
variance of the corresponding estimator [42]. In particular, if k is small, the model
may not capture the underlying structure of the data; if k is large, the model would
have a good chance to overfit the training data and result in a poor generalization
performance.
An alternative to CV is Generalized CV (GCV) [37, 96], an approach originally
defined to select the ridge parameter in ridge regression. GCV can be directly ex-
tended to do model selection with kernel approaches, as long as the hat matrix [37],
which projects the original response vector to the estimated one, can be obtained.
However, since this criterion is an approximation of the leave-one-out CV (i.e., n-fold
CV, where n is the number of training samples), the estimated result generally has
a large variance, i.e., the learned function is highly variant and dependent of the
training data, since in each fold almost the same data is used to train the model.
For classification, a popular group of methods for kernel parameter selection is
based on the idea of the between-within class ratio as Fisher had originally proposed
for LDA [28]. Here, we will refer to this as the Fisher criterion. Wang et al. [98] and
Xiong et al.[108] define such a criterion, which maximizes the between-class scatter
11
and minimizes the within-class scatter in the kernel space, to optimize the kernel
parameter. This criterion maximizes the class separability in the kernel space, and
it is shown generally to obtain better classification performance than CV. Similarly,
Wang et al. [97] develop another version of the Fisher criterion, defined as the trace
of the ratio between the kernel versions of the between-class scatter matrix and the
within-class scatter matrix (a.k.a. discriminant power). Due to the difficulty of direct
calculation of the discriminant power, they employ an approximated measure based
on a decomposition of the discriminant power [63]. In [49], the Fisher criterion is
reformulated as a convex optimization problem and then used to find a solution over
a convex set of kernels. Alternatively, Cristianini et al. [17] define the concept of
kernel alignment to capture the agreement between a kernel and the target data. It
is shown how this measure can be used to optimize the kernel. However, Xiong et
al. [108] show that this kernel-target alignment criterion is equivalent to maximizing
the between-class scatter, provided that the kernel matrix has been centralized and
normalized by its Frobenius norm. The major drawback with these criteria is that
they are only based on the measures of class separability. Note that the measure
for class separability is not always related to the classification error. For example,
since the Fisher criterion is based on a least-squares formulation [40], this can easily
over-weight the influence of the classes that are farthest apart [58], i.e., the classifier
will be biased to those classes which are already well separated.
Another solution is to come up with an approximation, usually an upper bound,
for the expected generalization error. Then optimization schemes are used to minimize
such approximations to select the kernel parameters. Cristianini et al. [16] optimize
the kernel parameters by minimizing an upper bound on the generalization error as
12
provided by the Vapnik-Chervonenkis (VC) theory. This upper bound depends on
the radius of the smallest ball containing the training set in the feature space and
the margin between the two classes. They propose a method to dynamically adjust
the kernel parameter during the SVM learning process to find the optimal kernel
parameter which provides the best possible upper bound on the generalization error.
Chapelle et al. [11] optimize the kernel parameters by minimizing different upper
bounds on the error in the leave-one-out procedure which is proved to provide an
almost unbiased estimate of the expected generalization error. The kernel parameters
are optimized by gradient descent methods. However, these approaches have some
limitations. Usually, it is not clear whether these upper bounds are tight enough to
give a good estimate. Moreover, the estimate of the leave-one-out error based on
which bounds are derived may have high variance [42], which may deteriorate the
selection of the kernel parameters.
In another group of methods, the kernel parameters are selected by maximizing
the marginal data likelihood after reformulating the learning problem as probabilis-
tic models. Well-known approaches in this group are the Relevance Vector Machine
(RVM) [91] and the Gaussian processes [79]. RVM uses Bayesian inference to ob-
tain parsimonious solutions for regression and classification. The learning is based
on a type of Expectation-Minimization (EM) method and only local minima could
be found. Gaussian processes provide probabilistic predictions to the test samples
using the Bayesian inference framework. The hyperparameters used in the mean and
covariance functions can be directly estimated by maximizing the marginal data like-
lihood. Gold and Sollich [35] give a probabilistic interpretation of SVM classification
by introducing the application of Bayesian methods to SVM. The SVM classifier is
13
then viewed as the maximum a posteriori (MAP) solution of the corresponding prob-
abilistic inference problem. Then, the kernel parameters in SVM are optimized by
maximizing the data likelihood. Glasmachers and Igel [34] propose a likelihood func-
tion of the kernel parameters to robustly estimate the class conditional probabilities
based on logistic regression, and kernel parameters are optimized by the maximiza-
tion of this likelihood function using gradient ascent. A major drawback of these
approaches is that since Bayesian learning generally leads to analytically intractable
posteriors, some approximation of the posteriors has to be made. This turns out
to be computationally very expensive. In addition, to estimate the priors of the
hyperparameters does not have a clear solution.
1.2.2 Kernel matrix learning
The approaches for kernel parameter learning need to specify a known parame-
terized kernel function. However, given the data at hand, one usually does not have
prior knowledge of which kernel function should be used. Different kernel functions
characterize different functional mappings, thus resulting in different performances.
Rather than learning the kernel parameters of a given kernel function, one could try
to directly learn the kernel matrix, which encodes the similarity of all the training
samples.
Liu et al. [54] propose to learn a (so-called) optimal neighborhood kernel matrix by
assuming that the pre-specified kernel matrix generated from the specific application
is a noisy observation of the ideal one. Kernel learning is then based on minimizing
the difference of the pre-specified kernel matrix and the learned one. Yeung et al.
[112] propose a method for learning the kernel matrix based on maximizing a class
14
separability measure. While a single kernel is known to be not sufficient to describe
the data, multiple kernel learning (MKL) has attracted much attention recently [51,
87]. In [51], the kernel matrix is obtained as a linear combination of pre-specified
base kernels and the optimal coefficients can be determined by using semidefinite
programming, a branch of convex optimization that deals with the optimization of
convex functions over the convex combination of positive semidifinite matrices. Wang
et al. [101] present an alternative approach to MKL. The input data is first mapped
into m different kernel spaces by m different kernel functions and each generated
kernel space is taken as one view of the input space. Then, by using Canonical
Correlation Analysis (CCA), a technique that maximally correlates the m views, a
regularization framework is proposed to guarantee the agreement of the multiview
outputs. Yet, the selection of the base kernel functions and associated parameters is
still an important issue and remains an open problem.
MKL is also applied to regression problems. In [76], MKL is applied to Support
Vector Regression (SVR). The coefficients that determine the combination of kernels
are learned using a constrained quadratic programming problem. This method was
shown to outperform CV in some applications. In another approach, the kernel pa-
rameters are selected by maximizing the marginal data likelihood after reformulating
the regression problem as probabilistic models using Bayesian inference. This ap-
proach has been used to define the well-known Relevance Vector Machine (RVM) [91]
and Gaussian processes for regression [104].
One of the disadvantages of the aforementioned approaches is that algorithms for
learning a kernel matrix often scale poorly, with running times that are cubic in the
number of the training samples; thus the application of these algorithms to large-scale
15
data-sets is limited. Moreover, the multiple kernel learning approach suffers from two
fundamental limitations. First, an explicit formulation to combine different kernels
has to be pre-specified. As it is common, some methods work best in one application
while others outperform it in different settings. Second, the kernel matrix can only
be searched within the space defined by these pre-specified functions. If the kernels
and their parameters are not appropriately specified, the learned kernel matrix will
not perform well in classification and regression.
1.2.3 New kernel development
A main issue of kernel methods is the selection of the kernel functions. Each kernel
characterizes a particular mapping, thus can be used in particular applications. An
appropriately selected kernel function for a given problem could result in a substantial
improvement of the generalization performance.
Although the popularly used kernels, such as the Gaussian RBF kernel and the
polynomial kernel, have shown successful performance in some applications, they have
some known limitations. For instance, the input sample should be in a vector form.
However, in many applications, the input samples could be an ensemble of vectors
and each vector couldhave different lengths. A good example for this type of data
is the protein sequence data. Jaakkola et al. [46] propose a Fisher-based kernel to
detect remote protein homologies. A probabilistic model for each protein sequence is
first built, then the Fisher score, which measures the gradient of the log-likelihood of
the model, is used to represent the sequence sample. Then the similarity between the
two sequences is measured by the inner product of the corresponding Fisher scores. A
good feature of the Fisher kernel is that it combines an underlying generative model
16
and discriminant classifiers (SVM) in the feature space. Similarly, Moreno et al. [69]
develop a Kullback-Leibler (KL) divergence-based kernel for the use of multimedia
applications. Each multimedia object (a sequence of vectors) is modeled as a Gaussian
distribution, and an intermediate space mapping the object to its probablity density
function (pdf) is constructed. The new kernel is evaluated based on the KL divergence
of the two pdfs. Wolf and Shashua [105] derive a more generic kernel for the instances
defined over a space of sets of vectors. Each sample object (a set of vectors) is viewed
as a linear subspace and the kernel is evaluated by measuring the principal angles
between two linear subspaces. This kernel is successfully applied to face recognition
from video.
Some kernels have been developed to be used in some particular applications. For
instance, Odone et al. [72] propose two kernels which are used for images. The images
are first represented as binary strings and then a kernel, as a similarity measure, is
used to operate on them. They further show that the image similarity measures given
by a histogram intersection and the Hausdorff distance can be modified to serve as
kernels. For text classification, Lodhi et al. [55] propose a string kernel to encode the
similarity between the strings. The kernel is generated by using all the subsequences
of length k. Each subsequence forms a dimension in the feature space and weighted
by an exponentially decaying factor of their full length in the text, thus emphasizing
those occurrences that are close to contiguous.
1.3 Research Contributions
From the literature review of model selection in kernel methods, several important
questions are raised. First, in classification, the original goal of a kernel methods is to
17
find a mapping such that the samples in the kernel space could be linearly classified.
To our surprise, no approach thus far has explicitly solved this problem. In other
words, the classifier in the kernel space is not ensured to be linear. Thus, our goal is
to define a first criterion for kernel optimization such that the linear classifier in the
kernel space can be obtained.
Second, in a kernel-based regression problem, model selection plays a significant
role in the regression performance. How to achieve a good balance between the
model fit and model complexity remains a big challenge. We propose an approach
for model selection by adopting multiobjective optimization. By doing so, the model
fit is reduced while the model complexity is kept in check. Finally, in the multiple
kernel learning approaches, an explicit combination of different kernels should be pre-
specified. Is there a way to learn a kernel matrix without specifying an explicit kernel
combination? We explore this idea by using Genetic Algorithms.
In this dissertation, we propose approaches for model selection in kernel methods
in supervised learning. Our approaches are theoretically justified and have been suc-
cessfully used in several applications. In particular, contributions of this dissertation
are as follows:
We develop two criteria to optimize the kernel parameters given a kernel func-
tion based on the idea of Bayes optimality. In the first criterion, kernel pa-
rameters are optimized such that the classification in the kernel space is Bayes
optimal. Thus, this solves the original goal of the kernel mapping: the class
distributions in the kernel space are linearly separable. We achieve this by max-
imizing the homoscedasticity and separability of the pairwise class distributions
18
simultaneously in the kernel space. We further relax the single Gaussian as-
sumption for class distributions by using a mixture of Gaussians, thus allowing
more flexibility in modeling the distributions. In the second criterion, instead
of searching for a linear classifier, we directly minimize the Bayes error over all
the kernel mappings. Specifically, we present an effective measure to approxi-
mate the Bayes accuracy (defined as one minus Bayes error) in the kernel space.
Then optimal kernel is then learned by maximizing this Bayes accuracy over all
kernel representations. Both criteria are shown to outperform the state of the
art kernel optimization approaches.
We propose a model selection framework in kernel-based regression methods.
In this framework, model fit and model complexity in the kernel space are first
directly derived from a decomposition of the generalization error of the learned
function. Then multiobjective optimization is employed to learn a good balance
between model fit and model complexity. A modified -constraint approach is
designed such that the Pareto-optimal solution can be achieved. We further
show that our approach can not only learn the kernel parameters, but those of
a kernel-based regression method.
Since a pre-specified kernel function is not appropriate for a general problem,
we propose to directly learn a kernel matrix using Genetic Algorithm (GA). By
doing so, we eliminate the need for defining a unique way of combining different
kernel matrices, thus allow more flexibility in modeling a general problem. To
achieve our goal, we define a novel representation used in genetic algorithm.
The kernel matrices are then iteratively modified until the matrix providing the
19
smallest classification error is obtained. To map test feature vectors, we define
a regression-based approach to determine the underlying function represented
by the selected kernel matrix. We provide comparative results against the state
of the art methods including multiple kernel learning and transductive learning.
The results shows the superiority of the proposed approach. We further extend
our method to work with regression and demonstrate its effectiveness.
We propose a family of kernels called Local-density Adaptive kernels. Such
kernels measure the sample similarities by taking into account local density
information. The shape of likelihood evaluation in the proposed kernels can
adaptively vary for different local regions based on a measure of the weighted
local variance. Also, the shape varies in an implicit way such that they are
ensured to be Mercer kernels (i.e., positive semi-definite kernels). The proposed
kernels are shown to perform better than the traditional fixed-shape kernels like
Gaussian RBF kernel and Mahalanobis kernel in several applications.
The rest of this dissertation is organized as follows. The first two criteria are
presented in Chapter 2. Chapter 3 derives a model selection framework based on
multiobjective optimization in regression. In Chapter 4, we derive the Local-density
Adaptive kernels. In Chapter 5, we propose a genetic-based approach to learn a kernel
matrix for both classification and regression. Conclusions and future work are given
in Chapter 6.
20
CHAPTER 2
KERNEL LEARNING IN DISCRIMINANT ANALYSIS
2.1 Introduction
Discriminant Analysis (DA) is one of the most popular approaches for feature ex-
traction with broad applications in, for example, computer vision and pattern recog-
nition [32], gene expression analysis [63] and paleontology [62]. The problem with
DA algorithms is that each of them makes assumptions on the underlying class dis-
tributions. That is, they assume the class distributions are homoscedastic, i = j ,
i, j. This is rarely the case in practise. To resolve this problem, one can first map
the original data distributions (with unequal covariances) into a space where these
become homoscedastic. This mapping may however result in a space of very large
dimensionality. To prevent this, one usually employs the kernel trick [84, 96]. In the
kernel trick, the mapping is only intrinsic, yielding a space of the same dimensionality
as that of the original representation while still eliminating the nonlinearity of the
data by making the class distributions homoscedastic. This is the underlying idea in
Kernel DA (KDA) [67, 5] and variants [110, 109, 40].
The approach described in the preceding paragraph resolves the problem of nonlin-
early separable Normal distributions, but still assumes each class can be represented
by a single Normal distribution. In theory, this can also be learned by the kernel,
21
since multimodality includes nonlinearities in the classifier. In practise however, it
makes the problem of finding the appropriate kernel much more challenging. One way
to add flexibility to the kernel is to allow for each class to be subdivided into several
subclasses. This is the underlying idea behind Subclass DA (SDA) [116]. However,
while SDA resolves the problem of multimodally distributed classes, it assumes that
these subclass divisions are linearly separable. Note that SDA can actually resolve
the problem of nonlinearly separable classes as long as there is a subclass division
that results in linearly separable subclasses yielding a non-linear classifier. The ap-
proach will fail when there is no such division. To resolve this problem, we require to
derive a subclass-based approach that can deal with nonlinearly separable subclasses
[12]. This can be done with the help of a kernel map. In this approach, we need
to find a kernel which maps the subclass division into a linearly separable set. We
refer to this approach as Kernel SDA (KSDA). Note that KSDA has two unknowns
the number of subclasses and the parameter(s) of the kernel. Hence, finding the
appropriate kernel parameters will generally be easier, a point we will formally show
in the present chapter.
The kernel parameters are the ones that allow us to map a nonlinearly separable
problem into a linear one [84]. Surprisingly, to the best of our knowledge, there is
not a single method in kernel DA designed to find the kernel parameters which map
the problem to a space where the class distributions are linearly separable. To date,
the most employed technique is k-fold cross-validation (CV). In CV, one uses a large
percentage of the data to train the kernel algorithm. Then, we use the remaining
(smaller) percentage of the training samples to test how the classification varies when
we use different values in the parameters of the kernel. The parameters yielding the
22
highest recognition rates are kept. More recently, [98, 49] showed how one can employ
the Fisher criterion (i.e., the maximization of the ratio between the kernel between-
class scatter matrix and the kernel within-class scatter matrix) to select the kernel
parameters. These approaches aim to maximize classification accuracy within the
training set. However, neither of them aims to solve the original goal of the kernel
map to find a space where the class distributions (or the samples of different classes)
can be separated linearly. Moreover, the Fisher criterion is based on the measures
of class separability. Note that the measure for the class separability is not always
related to the classification error.
In this chapter, we propose two approaches to learn the kernel parameters given
a kernel function. First, we derived an approach whose goal is to specifically map
the original class (or subclass) distributions into a kernel space where these are best
separated by a hyperplane (w.r.t. Bayes). The proposed approach also aims to
maximize the distance between the distributions of different classes, thus maximizing
generalization. We apply the derived approach to three kernel versions of DA, namely
LDA, Nonparametric DA (NDA) and SDA. We show that the proposed techniques
generally achieves higher classification accuracies than the CV and Fisher criteria
defined in the preceding paragraph. In the second approach, we derive a criterion for
selecting the parameters by minimizing the Bayes classification error. To achieve this,
we define a function measuring the Bayes accuracy (i.e., one minus the Bayes error)
in the kernel space. We then show how this function can be efficiently maximized
using gradient ascent. It should be emphasized that this objective function directly
minimizes the classification error, which makes the proposed criterion very powerful.
We will also illustrate how we can employ the same criterion for the selection of other
23
parameters in discriminant analysis. In particular, we demonstrate the uses of the
derived criterion in the selection of the kernel parameters and the number of subclasses
in KSDA. Before we present the derivations of our approaches, we introduce a general
formulation of DA common to most variants. We also derived kernel versions for NDA
and SDA.
2.2 The metrics of discriminant analysis
DA is a supervised technique for feature extraction and classification. Theoreti-
cally, its advantage over unsupervised techniques is given by it providing that repre-
sentation where the underlying class distributions are best separated. Unfortunately,
due to the number of possible solutions, this goal is not always fulfilled in practice
[63]. With infinite time or computational power, one could always find the optimal
representation. With finite time and resources, it is generally impossible to account
for all the possible linear combinations of features, let alone a set of nonlinear com-
binations. This means that one needs to define criteria that can find an appropriate
solution under some general, realistic assumptions.
The least-squares extension of Fishers criterion [28, 32] is arguably the most
known. In this solution, LDA employs two symmetric, positive semi-definite matri-
ces, each defining a metric [63]. One of these metrics should measure within-class
differences and, as such, should be minimized. The other metric should account for
between-class dissimilarity and should thus be maximized. Classical choices for the
first metric are the within-class scatter matrix SW and the sample covariance matrix
X , while the second metric is usually given by the between-class scatter matrix SB .
Pn
The sample covariance matrix is defined as X = n1 i=1 (xi ) (xi )T , where
24
Pn
X = {x1 , . . . , xn } are the n training samples, xi Rp , and = n1 i=1 xi is the sam-
PC
ple mean. The between-class scatter matrix is given by SB = i=1 pi (i ) (i )T ,
P ni
where i = n1
i j=1 xij is the sample mean of class i, xij is the j th sample of class i,
ni is the number of samples in that class, C is the number of classes, and pi = ni /n
is the prior of class i. LDAs solution is then given by the generalized eigenvalue de-
composition equation 1
X SB V = V, where the columns of V are the eigenvectors,
and is a diagonal matrix of corresponding eigenvalues.
To loosen the parametric restriction on the above defined metrics, Fukunaga and
Mantock defined NDA [31], where the between-class scatter matrix is changed to a
PC PC Pni
non-parametric version, Sb = i=1 j=1 l=1 ijl (xil jil )(xil jil )T , where jil is the
j6=i
sample mean of the k-nearest samples to the samples xil that do not belong to class i,
and ijl is a scale factor that deemphasizes large values (i.e. outliers). Alternatively,
Friedman [30] proposed to add a regularizing parameter to the within-class measure,
allowing for the minimization of the generalization error. This regularizing parame-
ter can be learned using CV, yielding the method Regularized DA (RDA). Another
variant of LDA is given by Loog et al. [58], who introduced a weighted version of
the metrics in an attempt to downplay the roles of the class distributions that are
farthest apart. More formally, they noted that the above introduced Fisher criterion
1
PC1 PC T T
for LDA can be written as i=1 j=i+1 pi pj ij tr V SW V V ij V , where
ij = (i j )(i j )T , and ij are the weights. In Fishers LDA, all ij = 1. Loog
et al. suggest to make these weights inverse proportional to their pairwise accuracy
(defined as one minus the Bayes error). Similarly, we can define a weighted version
PC P nc P nc
of the within-class scatter matrix SW = c=1 k=1 l=1 ckl(xck xcl )(xck xcl )T .
25
In LDA, ckl are all equal to one. In its weighted version, ckl are defined ac-
cording to the importance of each sample in classification. Using the same no-
tation, we can also define a nonparametric between-class scatter matrix as SB =

PC1 Pni PC P nk
i=1 j=1 k=i+1 l=1 ijkl (xij xkl )(xij xkl )T , where ijkl are the weights. Note
that in these two definitions, the priors have been combined with the weights to
provide a more compact formulation.
All the methods introduced in the preceding paragraphs assume the class distribu-
tions are unimodal Gaussians. To address this limitation, Subclass DA (SDA) [116]
defines a multimodal between-subclass scatter matrix,
C1
XX Hi Hk
C X
X
B = pij pkl (ij kl )(ij kl )T , (2.1)
i=1 j=1 k=i+1 l=1
where pij = nij /n is the prior of the j th subclass of class i, nij is the number of
samples in the j th subclass of class i, Hi is the number of subclasses in class i,

1 Pnij
ij = nij k=1 xijk is the sample mean of the j th subclass in class i, and xijk denotes
the k th sample in the j th subclass in class i.
The algorithms summarized thus far assume the class (or subclass) distributions
are homoscedastic. To deal with heteroscedastic (i.e., non-homoscedastic) distribu-
tions, [57] defines a within-class similarity metric using the Chernoff distance, yielding
an algorithm we will refer to as Heteroscedastic LDA (HLDA). Alternatively, one can
use an embedding approach such as Locality Preserving Projection (LPP) [43]. LPP
finds that subspace where the structure of the data is locally preserved, allowing for
nonlinear classifications. An alternative to these algorithms is to employ a kernel
function which intrinsically maps the original data distributions to a space where
these adapt to the assumptions of the approach in use. KDA [67, 5] redefines the
26
within- and between-class scatter matrices in the kernel space to derive feature ex-
traction algorithms that are nonlinear in the original space but linear in the kernel
one. This is achieved by means of a mapping function (.) : Rp F . The sam-
ple covariance and between-class scatter matrices in the kernel space are given by
Pn PC T
X = n
1
i=1 ((xi ) )((xi ) )T and S
B = i=1 pi i i ,
Pn P ni
where = 1
n i=1 (xi ) is the kernel sample mean, and i = 1
ni j=1 (xij ) is the
kernel sample mean of class i.
Unfortunately, the dimensionality of F may be too large. To bypass this problem,
one generally uses the kernel trick, which works as follows. Let A and B be
two metrics in the kernel space and V the projection matrix obtained by A V =
B V . We know from the Representers Theorem [96] that the resulting projection
matrix can be defined as a linear combination of the samples in the kernel space (X)
with the coefficient matrix , i.e., V = (X). Hence, to calculate the projection
matrix, we need to obtain the coefficient matrix by solving A = B , where
A = (X)T A (X) and B = (X)T B (X) are the two metrics that need to be
maximized and minimized. Using this trick, the metric for

X is given by B
X
=
Pn
(X)T
X (X) = n
1
i=1 (X)T ((xi ) )((xi ) )T (X) = n1 K(I Pn )K,
where K = (X)T (X) is the kernel (Gram) matrix and Pn is the n n matrix with
each of its element equal to 1/n.

PC PC
Similarly, BSW = 1
C i=1 (X)T
i (X) =
1
C
1 T
i=1 ni Ki (I Pni )Ki , where
i =
1 Pni T
ni j=1 ((xij )i )((xij )i ) is the kernel within-class covariance matrix of class
i, and Ki = (X)T (Xi ) is the subset of the kernel matrix for the samples in class i.
PC
The metric for S
B can be obtained as AS
B
= i=1 pi (Ki1ni K1n )(Ki 1ni K1n )T ,
where 1ni is a vector with all elements equal to 1/ni . The coefficient matrix for KDA
27
is given by B1
KDA AKDA KDA = KDA KDA , where BKDA can be either B
X
or BSW
and AKDA = ASB .
We can similarly derive kernel approaches for the other methods introduced above.
For example, in Kernel NDA (KNDA), the metric A is obtained by defining its
corresponding scatter matrix in the kernel space as
AKN DA = (X)T S
b (X)
C X
X ni
C X

= ijl (kil Mjil 1k )(kil Mjil 1k )T ,
i=1 j=1 l=1
j6=i
where kil = (X)T (xil ) is the kernel space representation of the sample xil , Mjil =
(X)T (Xjil ) is the kernel matrix of the k-nearest neighbors of xil , Xjil is a matrix

whose columns are the k-nearest neighbors of xil , and ijl is the normalizing factor
computed in the kernel space.
Kernel SDA (KSDA) maximizes the kernel between-subclass scatter matrix

B
[12]. This matrix is given by replacing the subclass means of (2.1) with the kernel
Pnij
subclass means ij = n1
ij k=1 (xijk ). Now, we can use the kernel trick to obtain
the matrix to be maximized, AKSDA =
C1
XX Hi Hk
C X
X
pij pkl (Kij 1ij Kkl 1kl )(Kij 1ij Kkl 1kl )T ,
i=1 j=1 k=i+1 l=1
where Kij = (X)T (Xij ) is the kernel matrix of the samples in the j th subclass of
class i, and 1ij is a nij 1 vector with all elements equal to 1/nij .
If we are to successfully employ the above derived approaches in practical settings,
it is imperative that we define criteria to optimize these parameters. The classical
approach to determine the parameters of the kernel is CV, where we divide the train-
ing data into k parts: (k 1) of them for training the algorithm with distinct values
28
for the parameters of the kernel, and the remaining one for validating which of these
values results in higher (average) classification rates. This solution has three major
drawbacks. First, the kernel parameters are only optimized for the training data, not
the distributions [117]. Second, CV is computationally expensive and may become
very demanding for large data-sets. Third, not all the training data can be used to op-
timize the parameters of the kernel. To avoid these problems, [98] defines a criterion
to maximize the kernel between-class difference and minimize the kernel within-class
scatter as Fisher had originally proposed but now applied to the selection of the
kernel parameters. This method was shown to yield higher classification accuracies
than CV in a variety of problems. A related approach [49] is to redefine the kernelized
Fisher criterion as a convex optimization problem. Alternatively, Ye et al. [111] have
proposed a kernel version of RDA where the kernel is learned as a linear combination
of a set of pre-specified kernels. However, these approaches do not guarantee that
the kernel or kernel parameters we choose will result in homoscedastic distributions
in the kernel space. This would be ideal, because it would guarantee that the Bayes
classifier (which is the one with the smallest error in that space) is linear.
In the sections to follow, we will present our approaches in kernel optimization.
2.3 Homoscedastic criterion
The goal of the first criterion is to find a kernel which maps the original class
distributions to homoscedastic ones while keeping them as far apart from each other
as possible. This criterion is related to the approach presented in [41] where the goal
was to optimize a distinct version of homoscedasticity defined in the complex sphere.
29
The criterion we derive here could be extended to work in the complex sphere and is
thus a more general approach.
2.3.1 Maximizing homoscedasticity
To derive our homoscedastic criterion, we need to answer the following question.
What is a good measure of homoscedasticity? That is, we need to define a criterion
which is maximized when all class covariances are identical. The value of the criterion
should also decrease as the distributions become more different. We now present a
key result applicable to this end.
Theorem 1. Let
i and j be the kernel covariance matrices of two Normal dis-
tr(
i j )
tributions in the kernel space defined by the function (.). Then, Q1 = tr(
2 2
i )+tr(j )
takes the maximum value of .5 when

i = j , i.e., when the two Normal distribu-
tions are homoscedastic in the kernel space.
Proof.
i and j are two p p positive semi-definite matrices with spectral decom-

T
positions
i = Vi i Vi , where Vi = vi1 , . . . , vip and i = diag i1 , . . . , ip
are the eigenvector and eigenvalue matrices.

2 2
The denominator of Q1 , tr(
i ) + tr(j ), only depends on the selection of the
kernel. For a fixed kernel (and fixed kernel parameters), its value is constant regardless
2 2 2 2
of any divergence between
i and j . Hence, tr(i )+tr(j ) = tr(i )+tr(j ).
T
We also know that tr(
i j ) tr(i j ), with the equality holding when Vi Vj =
I [90], i.e., the eigenvectors of

i and j are not only the same but are in the same
order, vik = vjk . Using these two results, we can write

Pp
m=1 im jm
Q1 Pp 2 Pp 2 .
m=1 im + m=1 jm
30
Now, let us define every eigenvalue of
i as a multiple of those of j , i.e.,
im = km jm , km 0, m = 1, . . . , p. This allows us to rewrite our criterion as

Pp 2
m=1 km jm
Q1 Pp 2
.
2
m=1 jm (km + 1)
From the above equation, we see that Q1 0, since all its variables are positive.
The maximum value of Q1 will be attained when all km = 1, which yields Q1 = .5.
We now note that having all km = 1 implies that the eigenvalues of the two covariance
matrices are the same. We also know that the maximum of Q1 can only be reached
when the eigenvectors are the same and in the same order, as stated above. This
means that the two Normal distributions are homoscedastic in the kernel space defined
by (.) when Q1 = .5.
From the above result, we see that we can already detect when two distributions
are homoscedastic in a kernel space. This means that for a given kernel function,
we can find those kernel parameters which give us Q1 = .5. Note that the closer we
get to this maximum value, the more similar the two distributions ought to be, since
their eigenvalues will become closer to each other. To show this, we would now like
to prove that when the value of Q1 increases, then the divergence between the two
distributions decreases.
Divergence is a classical mechanism used to measure the similarity between two
distributions. A general type of divergence employed to calculate the similarity be-
tween samples from convex sets is the Bregman divergence [8]. Formally, for a given
continuously-differentiable strictly convex function G : Rpp R, the Bregman di-
vergence over real symmetric matrices is defined as
BG (X, Y) = G(X) G(Y) tr(G(Y)T (X Y)), (2.2)
31
where X, Y {Z | Z Rpp , and Z = ZT }, and is the gradient.
Note that the definition given above for the Bregman divergence is very general. In
fact, many other divergence measures (such as the Kullback-Leibler) as well as several
commonly employed distances (e.g. Mahalanobis and Frobenius) are a particular case
of Bregmans. Consider the case where G(X) = tr(XT X), which computes the trace of
the covariance matrix, i.e., the Frobenius norm. In this case, the Bregman divergence
is BG (1 , 2 ) = tr(21 ) + tr(22 ) 2tr(1 2 ), where, as above, i are the covariance
matrices of the two distributions that we wish to compare. We can also rewrite this
result using the covariances in the kernel space as,
2 2
BG (
1 , 2 ) = tr(1 ) + tr(2 ) 2tr(1 2 ),
where now G(X) = tr((X)T (X)).
Note that to decrease the divergence (i.e., the value of BG ), we need to minimize
2 2
tr(
1 ) + tr(2 ) and/or maximize tr(1 2 ). The more we lower the former and
increase the latter, the smaller the Bregman divergence will be. Similarly, when we
2 2
decrease the value of tr(
1 ) + tr(2 ) and/or increase that of tr(1 2 ), we make
the value of Q1 larger. Hence, as the value of our criterion Q1 increases, the Bregman
divergence between the two distributions decreases, i.e., the two distributions become
more alike. This result is illustrated in Fig. 2.1. We can formally summarize this
result as follows.
Theorem 2. Maximizing Q1 is equivalent to minimizing the Bregman divergence
BG (
1 , 2 ) between the two kernel covariance matrices 1 and 2 , where G(X) =
tr((X)T (X)).
32
0.5o
0.45 o
0o 45
0.4
90o
o
Q
1
0.35
0.3 o
0.25
0 20 40 60 80
(a) (b) (c) (d)
Figure 2.1: Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale and rotation.
(a) The value of Q1 decreases as the angle increases. The 2D rotation between the two
distributions is in the x axis. The value of Q1 is in the y axis. (b) When = 0o , the
two distributions are homoscedastic, and Q1 takes its maximum value of .5. Note how for
distributions that are close to homoscedastic (i.e., 0o ), the value of the criterion remains
high. (c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3.
We have now shown that the criterion Q1 increases as any two distributions be-
come more similar to one another. We can readily extend this result to the multiple
distribution case,
C1 C
2 X X tr(
i k )
Q1 () = 2 2
, (2.3)
C(C 1) i=1 k=i+1 tr(i ) + tr(k )
where
i is the sample covariance matrix of the i
th
class. This criterion measures
the average homoscedasticity of all pairwise class distributions.
This criterion can be directly used in KDA, KNDA and others. Moreover, the
same criterion can be readily extended to work in KSDA,
Hi X Hk
1 C1
XX C X
tr(
ij kl )
Q1 (, H1, . . . , HC ) = 2 2
,
h i=1 j=1 k=i+1 l=1 tr(ij ) + tr(kl )
where
ij is the sample covariance matrix of the j
th
subclass of class i, and h is the
number of summing terms.
33
The reason we needed to derive the above criterion is because, in the multi-class
case, the addition of the Bregman divergences would cancel each other out. Moreover,
the derived criterion is scale invariant, while Bregman is not.
It may now seem that the criterion Q1 is ideal for all kernel versions of DA. To
study this further, let us define a particular kernel function. An appropriate kernel
is the RBF (Radial Basis Function), because it is specifically tailored for Normal
distributions. We will now show that, although homoscedasticity guarantees that
the Bayes classifier is linear in this RBF kernel space, it does not guarantee that
the class distributions will be separable. In fact, it can be shown that Q1 may
favor a kernel map where all (sub)class distributions become the same, i.e., identical
covariance matrix and mean. Indeed a particular but useless case of homoscedasticity
in classification problems.
2
Theorem 3. The RBF kernel is k(xi , xj ) = exp kxi x

jk
, with scale parameter
. In the two class problem, C = 2, let the pairwise between class distances be
{D11 , D12 , . . . , Dn1 n2 }, where Dij = kxi xj k22 is the (squared) Euclidean distance
calculated between two sample vectors, xi and xj , of different classes, and n1 and
n2 are the number of elements in each class. Similarly, let the pairwise within class
distances be {d111 , d112 , . . . , d1n1n1 , d211 , d212 , . . . , d2n2 n2 }, where dckl = kxck xcl k22 is the
Euclidean distances between sample vectors of the same class c. And, use SW with
the normalized weights

2dckl
exp
ckl = P2 P nc P nc 2dckl

c=1 k=1 l=1 exp
and SB with the normalized weights

exp 2Dij
1i2j = Pn Pn
2Dij
.
1 2
i=1 j=1 exp
34
Q1
Then, if tr(SB ) > tr(SW ), Q1 (.) monotonically increases with , i.e.,
0.
Proof. Note that both of the numerator and denominator of Q1 can be written in
P P
the form of i j exp (2kxi xj k22 /). Its partial derivative with respect to is,
P P 2kxi xj k22 Q1
i j 2
exp (2kxi xj k22 /). Substituting for Dij and dkl , we have
equal
to
Pn1 Pn2 P2 P nc P nc
2dckl
i=1 j=1 exp 2Dij 2Dij
2 c=1 k=1 l=1 exp
hP P nc P nc i2
2 2dckl
c=1 k=1 l=1 exp
P2 Pnc Pnc Pn1 Pn2
2dckl 2dckl 2Dij
c=1 k=1 l=1 exp 2 i=1 j=1 exp
hP P nc P nc 2dckl
i2 .
2
c=1 k=1 l=1 exp
We want to know when Q1 / 0, which is the same as

Pn1 Pn2 P2 Pnc Pnc
2Dij 2dckl
i=1 j=1 exp

Dij c=1 k=1
dckl
l=1 exp
Pn1 Pn2
2Dij
> P2 Pnc Pnc 2dc .
j=1 exp l=1 exp
kl
i=1 c=1 k=1
The left hand side of this inequality is the estimate of the between class variance,
while the right hand side is the within class variance estimate, since Dij and dcij can
be rewritten as the trace of the outer product tr((xi xj )(xi xj )T ). Substituting
for the above defined ckl and 1i2j , we have Q1 / 0 when tr(SB ) > tr(SW ).
Q1
This latest theorem shows that when approaches infinity,
approaches zero
and, hence, Q1 tends to its maximum value of .5. Increasing to infinity in the RBF
kernel will result in a space where the two class distributions become identical. This
will happen whenever tr(SB ) > tr(SW ). This is a fundamental theorem of DA because
it shows the relation between KDA, the weighted LDA version of [58] and the NDA
method of [31]. Theorem 3 shows that these variants of DA are related to the idea
of maximizing homoscedasticity as defined in this chapter. It also demonstrates the
35
importance of the metrics in weighed LDA and NDA. In particular, the above result
proves that if, after proper normalization, the between class differences are larger
than the within class differences, then classification in the kernel space optimized
with Q1 will be as bad as random selection. One indeed wants the class distributions
to become homoscedastic in the kernel space, but not at the cost of classification
accuracy, which is the underlying goal.
To address the problem outlined in Theorem 3, we need to consider a second
criterion which is directly related to class separability. Such a criterion is simply
given by the trace of the between-class (or -subclass) scatter matrix, since this is
proportional to class separability,

C1
X C
X
Q2 () = tr S
B = tr pi pk (i k )(i k )T
i=1 k=i+1
C1
X C
X
= pi pk ki k k2 . (2.4)
i=1 k=i+1
Again, we can readily extend this result to work with subclasses,

Q2 (, H1 , . . . , HC ) = tr
B
C1
XX Hi Hk
C X
X
= pij pkl kij kl k2 .
i=1 j=1 k=i+1 l=1
Since we want to maximize homoscedasticity and class separability, we need to
combine the two criteria of (2.3) and (2.4),
Q(.) = Q1 (.) Q2 (.). (2.5)
The product given above is an appropriate way to combine independent measures of
different magnitude as is the case with Q1 and Q2 .
Using the criterion given in (2.5), the optimal kernel function, , is
= arg max Q().

36
In KSDA, we optimize the number of subclasses and the kernel as
, H1, . . . , HC = arg max Q(, H1 , . . . , HC ).

,H1 ,...,HC
Also, recall that in KSDA (as in SDA), we need to divide the data into subclasses. As
stated above we assume that the underlying class distribution can be approximated
by a mixture of Gaussians. This assumption, suggests the following ordering of the

c = {x , . . . , x }, where x and x are the two most dissimilar feature
samples: X 1 n 1 n
vectors and xk is the k 1th feature vector closest to x1 . This ordering allows us to
c into H parts. This
divide the set of samples into H subgroups, by simply dividing X
approach has been shown to be appropriate for finding subclass divisions [116].
As a final note, it is worth emphasizing that, as opposed to CV, the derived crite-
rion will use the whole data in the training set for estimating the data distributions
because there is no need for a verification set. With a limited number of training sam-
ples, this will generally yield better estimates of the unknown underlying distribution.
The other advantage of the derived approach is that it can be optimized using gra-
dient descent, by taking Q(k(xi , xj ))/. In particular, we employ a quasi-Newton
approach with a Broyden-Fletcher-Goldfarb-Shanno Hessian update [21]. The main
advantage of this method is that it has a fast converge and does not require the
calculation of the Hessian matrix. Instead, the Hessian is updated by analyzing the
gradient vectors. The derivation of the gradient of our criterion is shown in the sec-
tion to follow. The initial value for the kernel parameter is set to be the mean of the
distances between all pairwise training samples.
37
2.3.2 Derivation of the Gradient

kxi xj k2
We take (.) to be the RBF function, k(xi , xj ) = exp
, with the
parameter to be optimized. And, we consider the case where each class distribution
is modeled by a single Gaussian distribution. The derivations for the subclass case
follows immediately from the ones given below.
The gradient of our criterion Q(.), when considering the RBF kernel, is given by
Q() (Q1 ()Q2 ()) Q1 () Q2 ()

= = Q2 () + Q1 ().

The partial derivative of Q1 () with respect to the RBF parameter is

2 2
C1 C tr(
i k )
2 2 tr(
i )
tr( )
Q1 () 2 X X

(tr(
i ) + tr(
k )) tr(i k )(
+
k
)
= 2 2 2
C(C 1) i=1 k=i+1 (tr(i ) + tr(k ))
Note that T
i = (Xi )(I 1ni )(Xi ) , where (Xi ) = ((xi1 ), ..., (xini )) and 1ni
is a ni ni matrix with all elements equal to 1/ni . Then, tr(

i k ) = tr((Xi )(I
1ni )(Xi )T (Xk )(I 1nk )(Xk )T ) = tr(Kki (I 1ni )Kik (I 1nk )), where Kik =
(Xi )T (Xk ). Let Kki = Kki(I 1ni ) and Kik = Kik (I 1nk ). We can rewrite this
result as,
XX
tr(KkiKik ) = Kpq qp
ki Kik ,
p q
f pq is the (p, q)th entry of K

where K f . Denote the partial derivative of an m n
ki ki
h i
K Kpq Kpq k(xp ,xq )
matrix K with respect to as
= p=1,...,m,q=1,...,n
, with
=
=
kxp xq k2 2
3
exp( kxp2
xq k
2 ) when using the RBF function. Then,
tr(
i k ) tr(Kki Kik )
=

X X Kpq qp !
ki qp pq Kik
= Kik + Kki
p q
" ! !#
XX Kpq
ki Kpq
ki qp
qp
pq Kik Kqp
ik
= 1i Kik + Kki 1k .
p q
38
Next, note that Q2 () can be written as
C1
X C
X
Q2 () = pi pk dik ,
i=1 k=i+1
where
dik = (i k )T (i k )
= ((Xi )1i (Xk )1k )T ((Xi )1i (Xk )1k )
= 1Ti Kii 1i 21Ti Kik 1k + 1Tk Kkk 1k .
Using this notation, the gradient of Q2 () with respect to is

!
Q2 () C1
X X C
dik C1
X X C
Kii Kik Kkk
= pi pk = pi pk 1Ti 1i 21Ti 1k + 1Tk 1k .
i=1 k=i+1 i=1 k=i+1
This result allows us to iteratively determine an appropriate solution. To see
that the solution found with such a gradient descent technique is an appropriate
one, recall that Theorem 2 showed Q1 monotonically increases if tr(SB ) > tr(SW ).
In most practical problems this condition is satisfied, since otherwise the classes
mostly overlap and the classification problem is not solvable (i.e., there is a very large
classification error in the original feature space). This means there is an identifiable
global maximum. We now note that the same applies to Q2 . That is, as long as the
class distributions do not overlap significantly, Q2 has a unique maximum for a sigma
value in between the averaged within class sample distances and the averaged between
class sample distances. To see this, note that for every Q2 calculated for a pair of
classes (i.e., classes 1 and 2), there are three main components: the sum of the kernel
matrix elements in class 1, in class 2, and between classes 1 and 2. Each of these
components monotonically increases with respect to sigma (starting with 1/n1 , 1/n2 ,
0, and converging to 1). The fastest increases occur for sigma around the averaged
39
distance in that component; e.g., for within class 1, this will be around the averaged
distance of the samples in that class. This means that the within class components
will converge earlier than the between class distances. Hence, the sum of the within
class subtracted with two times the between class elements (in the kernel matrix)
will result in a maximum in between the averaged within class sample distances and
between class sample distances.
In some applications where our conditions may not hold, it would be appropriate
to test a few starting values to determine the best solution. We did not require this
procedure in our experiments.
2.3.3 Generalization
A major goal in pattern recognition is to find classification criteria that have a
small generalization error, i.e., small expected error on the unobserved data. This
mainly depends on the number of samples in our training set, training error and the
model (criterion) complexity [42]. Since the training set is usually fixed, we are left
to select a proper model. Smooth (close to linear) classifiers have a small model com-
plexity but large training error. On the other hand wiggly classifiers may have a small
training error but large model complexity. To have a small generalization error, we
need to select a model that has moderate training error and model complexity. Thus,
in general, the simpler the classifier, the smaller the generalization error. However, if
the classifier is too simple, the training error may be very large.
KDA is limited in terms of model complexity. This is mainly because KDA as-
sumes each class is represented with unimodal distributions. If there is a multimodal
structure in each class, KDA would select wiggly functions in order to minimize the
40
(a) (b) (c) (d)
Figure 2.2: Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions that have small
training error. (c) However, when the model complexity is small, = 3, KDA fails. (d)
KSDAs solution resolves this problem with piecewise smooth, nonlinear classifiers.
classification error. To avoid this, the model complexity may be limited to smooth
solutions, which would generally result in large training errors and, hence, large gen-
eralization errors.
This problem can be solved by using an algorithm that considers multimodal
class representations, e.g., KSDA. While KDA can find wiggly functions to separate
multimodal data, KSDA can find several functions which are smoother and carry
smaller training errors. We can illustrate this theoretical advantage of KSDA with a
simple 2-class classification example, Fig. 2.2. In this figure, each class consists of 2
nonlinearly separable subclasses. Fig. 2.2(a) shows the solution of KDA obtained with
the RBF kernel with = 1. Fig. 2.2(b) shows the KSDA solution. KSDA can obtain a
classification function that has the same training error with smaller model complexity,
i.e., smoother classification boundaries. When we reduce the model complexity by
increasing to 3, KDA leads to a large training error, Fig. 2.2(c). This does not
occur in KSDA, Fig. 2.2(d). A similar argument can be used to explain the problems
faced with Maximum Likelihood (ML) classification when modeling the original data
as a Mixture of Gaussians (MoG) in the original space. Unless one has access to a
41
Figure 2.3: The original data distributions are mapped to different kernel spaces via dif-
ferent mapping functions (.). (2 ) is better than (1 ) in terms of the Bayes error.
sufficiently large set (i.e., proportional to the number of dimensions of this original
feature space), the results will not generalize well.
2.4 Kernel Bayes accuracy criterion
The second criterion we will define in this chapter is directly related to the concept
of Bayes classification error. The idea is to learn the kernel parameters by finding a
kernel representation where the Bayes classification error is minimized across all the
mappings. This is illustrated in Figure 2.3. We start with an analysis of LDA. One
of the drawback of LDA is that its solution is biased toward those classes that are
furthest apart. To see this, note that LDA is based on least-squares (i.e., an eigenvalue
decomposition defined to solve a system of homogeneous equations [40]). Thus, the
42
LDA solution tends to over-weight the classes that were already well-separated in
the original space. In order to downplay the roles of the class distributions that are
farthest apart, [58] introduces a weighted version of SB , defined as

C1
X C
X
B = pi pj (ij )(i j )(i j )T , (2.6)
i=1 j=i+1
where 2ij = (i j )T 1
X (i j ) is the Mahalanobis distance between classes i
and j, : R+ +
0 R0 is a weighting function, (ij ) =
1
22ij
erf ( 2ij2 ), and erf (x) =
Rx 2
2 et dt is the error function.
0
One advantage of (2.6) is that it is related to the mean pairwise Bayes accuracy
[58] (i.e., one minus the Bayes error), since

d C1
X X X C
J(L) = pi pj (ij )tr(eTm ij em ), (2.7)
m=1 i=1 j=i+1
PC1 PC
where L = (e1 , ..., ed ) is the eigenvector matrix of i=1 j=i+1 pi pj (ij )ij , ij =
(i j )(i j )T are the pairwise class distances, and, for simplicity, we have
assumed X = Ip , Ip an identity matrix with dimension p p.
2.4.1 Bayes accuracy in the kernel space
As mentioned above, (2.7) is proportional to the Bayes accuracy and as such it
can be employed to improve LDA [58]. We want to derive a similar function for its
use in the kernel space.
Let (.) : Rp F be a function defining the kernel map. We also assume the data
has already been whitened in the kernel space. Denote the data matrix in the kernel
space (X), where (X) = ((x11 ), . . . , (xini ), . . . , (xCnC )). The kernel matrix is
given by K = (X)T (X).
Using this notation, the covariance matrix in the kernel space can be written as
1 T
X = n (X)(In Pn )(X) , where In is the n n identity matrix, and Pn is a
43
n n matrix with all elements equal to 1/n. The whitened data matrix (X) is now
1 T
given by (X) = 2 V (X), where and V are the eigenvalue and eigenvector
matrices given by
X V = V . We know from the Representers Theorem [96]
that a projection vector lies in the span of the samples in the kernel space (X), i.e.,
V = (X), where is a corresponding coefficient matrix. Thus, we have
1 T
(X) = 2 V (X)
1 1
= 2 T (X)T (X) = 2 T K,
where and can be calculated from a generalized eigenvalue decomposition problem
N = K, with N = n1 K(In Pn )K. With this trick, we transform the kernel
covariance matrix
X into the identity matrix.
Next, define the mean of class i in the kernel space as
i = (Xi )1i , (2.8)
where (Xi ) = ((xi1 ), . . . , (xini )), and 1i is a ni 1 vector with all elements equal
to 1/ni . Let Ki = (X)T (Xi) denote the subset of the whitened kernel matrix for
the samples in class i.
Combining the above results, we can define the Bayes accuracy in the kernel space
as
d C1
X X X C
T
Q() = pi pj (
ij )em Sij em , (2.9)
m=1 i=1 j=i+1
where e1 , ..., ed are the eigenvectors of the weighted kernel between-class scatter ma-
trix
C1
X C
X
pi pj (
ij )Sij ,
i=1 j=i+1
44
T
S
ij = (i j )(i j ) , the Mahalanobis distance ij in the whitened kernel space
becomes the Euclidean distance,
2

ij = (i j )T (i j )
= ((Xi )1i (Xj )1j )T ((Xi )1i (Xj )1j )
= 1Ti Kii 1i 21Ti Kij 1j + 1Tj Kjj 1j , (2.10)
and Kij = (Xi )T (Xj ) is the subset of the kernel matrix for the samples in class i
and j.
From the Representers Theorem [96], we know that ei = (X)ui , where ui is a co-
T
efficient vector. Then, using (2.8) we have em S T
ij em = um Sij um , where Sij = (Ki 1i
PC1 PC
Kj 1j )(Ki 1i Kj 1j )T , and u1 , . . . , ud are the eigenvectors of i=1

j=i+1 pi pj (ij )Sij .
Therefore, criterion (2.9) can be rewritten as

d C1
X X X C
Q() = pi pj ( T
ij )um Sij um . (2.11)
m=1 i=1 j=i+1
By maximizing Q(), we favor a kernel representation where the sum of pairwise
Bayes accuracies is maximized. The optimal kernel function, , is given by
= arg max Q().

We will refer to the derived criterion given in (2.11) as Kernel Bayes Accuracy
(KBA) criterion.
2.4.2 Kernel parameters with gradient ascent
The first application of the above derived criterion is in determining the value
of the parameters of a kernel function. For example, if we are given the Radial
kxi xj k2
Basis Function (RBF) kernel, k(xi , xj ) = exp( 22
), our goal is to determine an
appropriate value of the variance .
45
To determine our solution, we employ a quasi-Newton method with a Broyden-
Fletcher-Goldfarb-Shanno Hessian update [21], yielding a fast convergence.
To compute the derivative of our criterion, note that (2.11) can be rewritten as
C1
X C
X
Q() = tr( pi pj (
ij )Sij )
i=1 j=i+1
C1
X C
X
= pi pj (
ij )tr(Sij ).
i=1 j=i+1
Q()
Taking the partial derivative with respect to in the RBF kernel, we have
=

PC1 PC (
ij ) tr(Sij )
i=1 j=i+1 pi pj
tr(Sij ) + (
ij ) .
K
Denote the partial derivative of an m n matrix K with respect to as
=
h i (
Kij K k(xi ,xj ) kx x k2 kx x k2 ij )
i=1,...,m,j=1,...,n
, with ij =
= i 3 j exp( i22j ). Then =
2
erf (
ij /2 2) ij exp(
ij /8) ij K K
3
+
2 2 2
, where
ij
= 21 (1Ti
Kii
1i 21Ti ij 1j +1Tj jj 1j ).
ij ij ij
tr(Sij ) (Ki 1i Kj 1j )T (Ki 1i Kj 1j ) KT KT

Finally,
=
= 1Ti
i
Ki 1i +1Ti KTi
Ki
1i 21Tj
j
Ki 1i
KT Kj
21Tj KTj
Ki
1i + 1Tj
j
Kj 1j + 1Tj KTj j
1.
2.4.3 Subclass extension
Another application of the derived KBA criterion is in determining the number
of subclasses in Subclass Discriminant Analysis (SDA) [116] and its kernel extension.
KDA assumes that each class has a single Gaussian distribution in the kernel space.
However, this may be too restrictive since it is usually difficult to find a kernel rep-
resentation where the class distributions are single Gaussians. In order to relax this
assumption, we can describe each class using a mixture of Gaussians. Using this idea,
we can reformulate (2.11) as

d C1
X XX Hi X Hk
C X
sub
Q (, H1, . . . , HC ) = pij pkl
m=1 i=1 j=1 k=i+1 l=1
( T
ij,kl )um Sij,kl um , (2.12)
46
where Hi is the number of subclasses in class i, u1 , ..., ud are d eigenvectors of the
kernel version of the weighted between-subclass scatter matrix
C1
XX Hi Hk
C X
X
pij pkl (
ij,kl )Sij,kl ,
i=1 j=1 k=i+1 l=1
Sij,kl = (Mij 1ij Mkl 1kl )(Mij 1ij Mkl 1kl )T , Mij = (X)T (Xij ), (Xij ) = ((xij1 ), . . . ,
(xijnij )), xijk is the k th sample of subclass j in class i, 1ij is a nij 1 vector with
all elements equal to 1/nij , and nij the number of samples in the j th subclass of class
i. Note that in the above equation, the whitened Mahalanobis distance is given by
2 T

ij,kl = (ij kl ) (ij kl )
= 1Tij Kij,ij 1ij 21Tij Kij,kl 1kl + 1Tkl Kkl,kl 1kl ,
where Kij,kl = (Xij )T (Xkl ). The optimal kernel function and subclass divisions
are given by
, H1 , . . . , HC = arg max Qsub (, H1 , . . . , HC ).

,H1 ,...,HC
2.4.4 Optimal subclass discovery
In KSDA we are simultaneously optimizing the kernel parameter and the number
of subclasses. It is in fact advantageous to do so, because it will allow us to find
the Bayes optimal solution when the classes need to be described with a mixture
of Gaussians in the kernel space. Furthermore, we can automatically determine the
underlying structure of the data. This last point is important in many applications.
We illustrate this with a set of examples.
In our case study, we generated a set of 120 samples for each of the two classes.
Each class was represented by a mixture of two Gaussians, with mean and diagonal
47
covariance randomly initialized. Then, (2.12) was employed to determine the ap-
propriate number of subclasses and parameter of the RBF kernel. This process was
repeated 100 times, each with a different random initialization of the means and co-
variances. The average of the maxima of (2.12) for each value of Hi (with H1 = H2 )
are shown in Fig. 2.4(a). We see that the derived criterion is on average higher for
the correct number of subclasses. We then repeated the process described in this
paragraph for the cases of 3, 4 and 5 subclasses per class. The results are in Fig.
2.4(b-d). Again, the maximum of (2.12) corresponds to the correct number of sub-
classes. Therefore, the proposed criterion can generally be efficiently employed to
discover the underlying structure of the data. For comparison, in Fig. 2.4(e-h) we
show the plots of the Fisher criterion described earlier. We see that this criterion does
not recover the correct number of subclasses and is generally monotonically increas-
ing, thus, tending to select large values for Hi . This is because the Fisher criterion
maximizes the between-subclass scatter and, generally, the larger Hi , the larger the
scatter.
As a more challenging case, we also consider the well-known XOR data classifi-
cation problem, Fig. 2.5(a). The values of (2.12) for different Hi are plotted in Fig.
2.5(b) and those of the Fisher criterion in (c). Once more, we see that the KBA
criterion is capable of accurately recovering the number of subclasses, whereas the
Fisher criterion is not.
48
0.3 0.3 0.25 0.25
0.25 0.2 0.2
0.2 0.15 0.15
KBA
KBA
KBA
KBA
0.2
0.1 0.1 0.1
0.15 0.05 0.05
0.1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i
(a) (b) (c) (d)

3 3 3 3
2.5 x 10 1.5 x 10 1 x 10 1 x 10
2
1
Fisher
Fisher
Fisher
Fisher
1.5 0.5 0.5
0.5
1
0.5 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i
(e) (f) (g) (h)
Figure 2.4: Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4, and (d,h) 5. The
x-axis specifies the number of subclasses Hi . The y-axis shows the value of the criterion
given in (2.12) in (a-d) and of the Fisher criterion in (e-h).
0.5
4
0.35 x 10
12
0 0.3 10
X2
0.25
Fisher
8
(7)
0.2 6
0.5
0.15 4
0.5 0 0.5 0.1 2
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
X1 H H
i i
(a) (b) (c)
Figure 2.5: (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion.
49
2.5 Experimental Results
2.5.1 Homoscedastic criterion
In this section, we will use our homoscedastic criterion to optimize the kernel
parameter of KDA, KNDA and KSDA. We will give comparative results with CV,
the Fisher criterion of [98], the use of the Bregman divergence, and other nonlinear
methods Kernel PCA (KPCA), HLDA and LPP and related linear approaches
LDA, NDA, RDA, SDA, and aPAC. The dimensionality of the reduced space is
taken to be the rank of the matrices used by the DA approach and to keep 90% of the
variance in PCA and KPCA. We also provide comparisons with Kernel Support Vector
Machines (KSVM) [93] and the use of ML in MoG [65], two classical alternatives for
nonlinear classification.
Databases and notation
The first five data-sets are from the UCI repository [7]. The Monk problem is given
by a 6-dimensional feature space defining six joints of a robot and two classes. Three
different case scenarios are considered, denoted Monk 1, 2 and 3. The Ionosphere
set corresponds to satellite imaging for the detection of two classes (structure or not)
in the ground. And, in the NIH Pima set, the goal is to detect diabetes from eight
measurements.
We also use the ETH-80 [53] database. It includes a total of 3, 280 images of the
following 8 categories: apples, pears, cars, cows, horses, dogs, tomatoes and cups.
Each category includes 10 objects (e.g., ten apples), Figure 2.6 . Each of the (80)
objects has been photographed from 41 orientations. We resized all the images to
25 30 pixels. The pixel values in their vector form (x R750 ) are used in the
50
(a)
(b)
Figure 2.6: Shown here are (a) 8 categories in ETH-80 database and (b) 10 different objects
for the cow category.
appearance-based recognition approach. As it is typical in this database, we will use
the leave-one-object-out test. That is, the images of 79 objects are used for training,
those of the remaining object for testing. We test all options and calculate the average
recognition rate.
We also use 100 randomly selected subjects from the AR face database [61]. All
images are first aligned with respect to their eyes, mouth and jaw line before cropping
and resizing them to a standard size of 29 21 pixels. This database contains images
of two different sessions, each taken two weeks apart. The images in the first and
second session contain the same facial expressions and occlusions and were taken
under the same illumination conditions. We use the images in the first session for
training and those in the second session for testing.
We also use the Sitting Posture Distribution Maps data-set (SPDM) of [117].
Here, samples were collected using a chair equipped with a pressure sensor sheet
located on the sit-pan and back-rest. The pressure maps provide a total of 1, 280
pressure values. The database includes samples of 50 individuals. Each participant
51
provided five samples of each of the ten different postures. Our goal is to classify
each of the samples into one of the ten sitting postures. This task is made difficult by
the nonparametric nature of the samples in each class [117]. We randomly selected 3
samples from each individual and posture for training, and used the rest for testing.
The Modified National Institute of Standards and Technology (MNIST) database
of [52] is a large collection of various sets of handwritten digit (0-9). The training
set consists of 60,000 samples. The test set has 10,000 samples. All the digits have
been size-normalized to 28 28. We randomly select 30,000 samples for training,
with 3,000 samples in each class. This is done to reduce the size of the Gram matrix,
allowing us to run the algorithm on a desktop.
As defined above, we employe the RBF kernel. The kernel parameter in KPCA
is optimized with CV. CV is also used in KDA, KNDA and KSDA, denoted: KDACV ,
KNDACV and KSDACV . The kernel parameter is searched in the range [m 2st, m +
2st], where m and st are the mean and standard deviation of the distances between all
pairwise training samples. We use 10-fold cross validation in the UCI data-sets and
5-fold cross validation in the others. In KNDA and KSDA, the number of nearest
neighbors and subclasses are also optimized. In KSDA, we test partitions from 1
to 10 subclasses. We also provide comparative results when optimizing with the
approach of [98], denoted: KDAF , KNDAF and KSDAF . The two parameters of
LPP (i.e., the number of nearest neighbors, and the heat kernel) are optimized with
CV. The DA algorithms with our Homoscedastic-based optimization will be denoted:
KDAH , KNDAH and KSDAH . The same algorithms optimized using Bregman are
denoted: KDAB , KNDAB and KSDAB .
52
Results
The algorithms summarized above are first employed to find the subspace where
the feature vectors of different classes are most separated according to the algorithms
criterion. In the reduced space we employ a variety of classification methods.
In our first experiment, we use the nearest mean (NM) classifier. The NM is
an ideal classifier because it provides the Bayes optimal solution whenever the class
distributions are homoscedastic Gaussians [32]. Thus, the results obtained with the
NM will illustrate whether the derived criterion has achieved the desirable goal. The
results are shown in Table 2.1. We see that the kernel algorithms optimized with the
proposed Homoscedastic-based criterion generally obtain higher classification rates.
To further illustrate this point, the table includes a rank of the algorithms following
the approach of [20]. As predicted by our theory, the additional flexibility of KSDA
allows it to achieve the best results.
Our second choice of classifier is the classical nearest neighbor (NN) algorithm.
Its classification error is known to be less than twice the Bayes error. This makes it
appropriate for the cases where the class distributions are not homoscedastic. These
results are in Table 2.2. A recently proposed classification algorithm [75] emphasizes
smoother classification boundaries in the NN framework. This algorithm is based
on the approximation of the nonlinear decision boundary using the sample points
closest to the classification boundary. The classification boundary is smoothed using
Tikhonov regularization. Since our criterion is used to make the classifier in the kernel
space as linear as possible, smooth (close to linear) classifiers are consistent with this
goal and should generally lead to better results. We present the results obtained with
this alternative approach in Table 2.3.
53
Table 2.1: Recognition rates (in percentages) with nearest mean
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 82.6* 73.5 61.7 77.4 82.6* 81.6 61.7 71.6 76.2 74.6 65.6 73.6
AR database 88.1* 78.2 65.5 84.2 87.5* 86.7 69.5 84.2 71.3 61.4 72.5 74.3
SPDM 84.6* 80.1 67.9 83.9* 84.6* 83.2 67.9 83.3 82.4 82.9 53.4 75.6
Monk1 88.2* 85.0 71.1 88.0* 84.0 89.6* 65.3 83.1 70.1 65.7 50.0 63.4
Monk2 76.6 82.2* 56.7 74.5 80.1 75.2 55.6 70.1 73.5 64.8 61.8 71.8
Monk3 96.3* 88.7 85.4 94.0 93.1 89.7 85.7 82.4 67.6 63.7 77.8 66.4
Ionosphere 93.4 84.8 88.1 96.0* 93.4 86.1 67.6 80.8 74.8 62.3 65.6 78.2
Pima 80.4* 77.4 70.2 80.4* 78.6 75.0 75.0 72.6 65.5 67.3 70.8 66.7
Mnist 98.0* 96.9 92.0 97.4 98.1* 96.6 92.0 97.2 94.6 94.3 93.1 96.4
Rank 1.9* 7.0 13.3 3.6 2.8 5.4 14.2 9.2 12.2 14.7 15.8 13.3
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 56.9 56.5 63.3 64.9 64.0 58.2 65.9 71.6 70.9
AR database 75.5 86.7 42.2 24.0 79.3 69.7 24.2 67.4 46.2 78.6 79.3
SPDM 73.4 84.7* 62.6 66.4 44.5 52.5 65.3 68.0 54.7 59.5 69.3
Monk1 80.3 83.6 67.4 66.0 64.6 64.8 66.0 66.2 44.4 72.0 66.7
Monk2 75.9 82.6 53.7 53.5 55.1 60.0 53.5 53.5 48.6 60.0 55.1
Monk3 89.4 93.5 78.9 80.6 63.9 81.3 80.6 81.3 75.5 86.3 80.8
Ionosphere 82.1 96.0 89.4 62.3 57.0 92.1 62.3 90.1 55.0 82.8 90.1
Pima 75.0 79.2 50.0 56.0 61.3 74.4 56.0 77.4 67.9 66.7 61.3
Mnist 88.6 97.6* 80.6 82.2 86.7 85.9 82.2 85.5 80.1 87.0 88.2
Rank 9.8 2.7 18.0 19.1 18.3 14.4 18.4 14.1 19.9 12.4 12.8
Note that the results obtained with the Homoscedastic criterion are generally better than
those given by the Fisher, Bregman and CV criteria. The best of the three results in each
of the discriminant methods is bolded. The symbol * is used to indicate the top result
among all algorithms. Rank goes from smallest (best) to largest.
Finally, recall that the goal of the Homoscedastic criterion is to make the Bayes
classifier in the kernel space linear. If this goal were achieved, one would expect a
linear classifier such as linear Support Vector Machines (SVM) to yield good classifi-
cation results in the corresponding subspace. We verified this hypothesis in our final
experiment, Table 2.4.
As mentioned earlier, the advantage of the proposed criterion is not only that it
achieves higher classification rates, but that it does so at a lower computational cost,
Table 2.5. Note that the proposed approach generally reduces the running time by
one order of magnitude.
54
Table 2.2: Recognition rates (%) with nearest neighbor
ETH-80 82.8* 73.6 62.3 76.8 82.8* 81.0 62.3 71.6 76.2 74.6 68.0 70.6
AR database 96.7* 78.3 66.9 84.2 88.3 87.5 71.3 84.2 69.2 64.2 70.6 70.2
SPDM 84.9* 80.1 68.2 83.7 84.9* 84.2 68.2 83.3 73.9 75.6 33.5 70.3
Monk1 89.1* 84.5 78.2 87.5 84.3 89.6* 72.5 83.1 78.2 77.1 74.5 72.2
Monk2 77.8 83.1 86.1 75.7 80.1 75.2 77.6 70.1 85.0* 81.0 79.9 78.5
Monk3 94.4* 87.7 81.5 89.8 93.5 88.0 89.4 82.4 82.1 81.3 77.6 80.3
Ionosphere 94.4 84.8 91.4 94.0 94.4 86.5 70.9 80.8 87.4 86.1 90.1 86.1
Pima 75.0 73.8 66.7 76.8 70.2 69.8 64.9 72.6 67.3 67.3 66.1 69.1
Mnist 97.8* 96.9 91.8 97.2 97.2 97.1 91.8 96.7 95.6 95.4 92.1 95.5
Rank 2.9* 8.0 13.6 5.3 3.7 7.7 15.4 10.8 11.3 12.7 15.7 14.1
ETH-80 69.2 81.8 62.2 64.3 64.3 59.8 73.6 56.5 63.6 71.6 70.6
AR database 75.5 86.7 42.5 58.6 77.7 77.0 59.1 67.5 41.8 78.6 77.7
SPDM 73.4 84.7 75.0 81.5 66.5 48.8 81.1 65.3 54.1 59.5 66.1
Monk1 80.3 83.6 90.3* 81.3 69.0 68.3 81.0 84.2 61.6 72.0 75.7
Monk2 75.9 82.6 68.3 66.7 67.4 82.6 79.6 83.6 82.4 60.0 67.4
Monk3 89.4 93.5 87.8 87.3 70.6 83.6 88.4 84.5 80.6 86.3 85.9
Ionosphere 82.1 96.0* 89.4 92.1 74.8 88.8 92.1 88.7 68.2 82.8 93.4
Pima 75.0 79.2 56.0 64.3 57.7 69.1 62.5 68.5 66.8 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.7 85.6 89.3 80.6 96.0 87.0 93.7
Rank 12.6 3.2 14.2 14.3 18.4 15.4 12.7 14.3 17.4 16.2 14.1
Table 2.3: Recognition rates (%) with the smooth nearest-neighbor classifier
ETH-80 83.5* 73.9 62.3 76.4 83.5* 82.8 62.3 72.9 76.2 74.2 68.2 71.2
AR database 96.6* 78.5 66.9 85.1 90.6 86.7 71.3 85.1 70.9 63.2 70.6 72.6
SPDM 84.3* 75.3 68.2 83.9* 84.3* 83.4 68.2 82.6 75.6 77.9 35.6 71.5
Monk1 90.2* 76.6 71.5 82.9 89.6 87.7 72.2 88.7 65.2 62.0 61.4 62.3
Monk2 83.3* 77.5 60.6 75.7 80.6 82.9 73.8 78.5 74.1 64.8 62.3 56.9
Monk3 94.6* 83.3 86.1 86.3 93.5 92.4 89.4 91.2 68.5 64.8 85.4 66.2
Ionosphere 94.3 84.8 84.8 86.1 94.3 86.8 80.1 86.8 80.8 82.8 77.5 78.1
Pima 80.4* 76.8 79.2 76.2 78.6 73.0 64.9 69.0 72.0 67.9 69.0 67.9
Mnist 97.8* 96.9 91.8 97.3 97.2 97.2 91.8 96.7 95.6 95.4 92.1 95.6
Rank 1.2* 9.4 14.4 6.7 2.7 4.6 15 6.9 14.2 14.7 17.6 16.1
ETH-80 69.2 81.8 60.3 67.1 64.3 63.5 71.2 59.1 64.3 71.6 72.3
AR database 75.5 86.7 49.5 44.5 70.9 77.3 60.2 67.5 35.5 78.6 70.9
SPDM 73.4 84.7 75.1 77.0 56.2 50.2 81.2 53.4 50.2 59.5 69.5
Monk1 80.3 83.6 77.3 78.2 67.4 77.8 69.4 71.5 59.0 72.0 79.2
Monk2 75.9 82.6 58.6 56.7 70.6 70.6 70.4 58.3 72.0 60.0 70.6
Monk3 89.4 93.5 91.2 89.7 70.8 91.9 89.6 93.8 87.0 86.3 90.5
Ionosphere 82.1 96.0* 82.1 82.1 74.8 83.4 91.1 94.0 62.9 82.8 89.4
Pima 75.0 79.2 60.7 70.2 57.7 70.2 63.8 72.6 66.1 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.8 86.0 89.4 82.6 96.1 87.0 93.5
Rank 11.6 2.7 15.6 14.6 17.8 13.4 13.9 14.9 18.1 14.9 11.9
55
Table 2.4: Recognition rates (%) with linear SVM
ETH-80 83.0* 73.6 61.9 77.4 83.0* 82.2 61.9 71.3 75.6 75.2 65.6 74.6
AR database 88.1* 79.6 65.5 83.1 87.5* 86.7 69.5 83.1 79.4 75.7 72.5 78.6
SPDM 82.1 84.6* 67.5 82.3 82.1 83.6 67.5 82.6 82.2 82.9 52.7 84.0
Monk1 89.1* 88.2 50.0 86.1 84.7 89.7* 52.1 86.1 69.9 62.5 50.0 63.4
Monk2 77.1 81.5 67.1 73.8 80.1 75.2 67.1 75.1 67.1 83.1* 67.1 67.1
Monk3 95.6* 91.9 47.2 94.4 92.8 89.1 47.2 81.5 81.7 81.7 47.2 81.0
Ionosphere 93.4 86.1 82.1 96.7* 93.4 86.1 82.1 82.1 82.1 82.1 82.1 82.1
Pima 79.8* 78.6 64.9 79.8* 78.0* 75.0 64.3 72.8 64.3 64.3 64.3 64.3
Mnist 97.9 96.9 92.0 97.3 98.1* 96.7 92.0 97.2 94.7 94.3 93.3 96.2
Rank 2.8* 5.6 17.8 4.3 4.1 5.8 17.7 9.5 11.9 11.6 17.3 13.0
ETH-80 69.2 81.8 65.3 60.1 65.3 61.8 68.4 68.4 62.1 71.6 67.8
AR database 75.5 86.7 42.1 66.7 79.3 69.7 67.2 70.1 44.2 78.6 79.3
SPDM 73.4 84.7* 66.7 76.5 50.3 49.0 82.1 69.3 45.5 59.5 69.0
Monk1 80.3 83.6 88.4* 67.8 65.6 66.4 67.8 68.5 44.9 72.0 66.7
Monk2 75.9 82.6 50.0 67.1 67.1 67.5 65.6 67.1 67.1 60.0 67.1
Monk3 89.4 93.5 94.4 81.3 63.9 83.3 80.6 81.9 78.5 86.3 84.7
Ionosphere 82.1 96.0 82.1 84.8 84.8 88.1 93.4 93.4 82.1 82.8 90.1
Pima 75.0 79.2 64.3 68.6 64.9 76.8 77.4 76.2 76.2 66.7 64.9
Mnist 88.6 97.6 81.0 82.2 86.9 85.9 83.1 85.4 80.1 87.0 88.2
Rank 11.5 3.3 16.1 16.1 15.8 14.6 13.6 12.5 19.0 13.8 12.9
Table 2.5: Training time (in seconds)

Data set ksdaH ksdaCV kdaH kdaCV kndaH kndaCV ksvm
ETH-80 7.3104 3.6105 1.8103 9.0104 7.9104 8.5105 1.8104
AR database 4.2104 3.5105 3.1103 9.0104 1.5104 1.7105 1.2104
SPDM 1.8104 6.5104 1.8102 4.6104 2.1104 1.6105 9.6103
Monk1 4.4 51.3 0.7 6.8 26.4 504.8 3.7
Monk2 4.6 88.1 1.2 11.5 41.3 978.1 17.8
Monk3 3.2 50.7 0.7 6.4 23.1 516.0 2.2
Ionosphere 6.6 134.8 1.3 15.7 76.6 1479.5 10.1
Pima 80.2 2521.7 12.1 380.1 374.4 10889.7 150.6
MNIST 3.6105 2.0106 1.9105 1.1106 3.2105 4.6106 4.5105
56
2.5.2 KBA criterion
We now present results using KBA criterion. We use this criterion in KDA and
KSDA. We use the notation KDAK and KSDAK to indicate that the KBA criterion
was used to optimize the parameters.
Table 2.6: Recognition rates (%) with nearest neighbor. Bold numbers specify the top
recognition obtained with the three criteria in KSDA and KDA. An asterisk specifies
a statistical significance on the highest recognition rate.
Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca

ETH-80 84.6* 73.6 76.8 84.6* 81.0 71.6 62.2
AR database 88.2* 78.3 84.2 86.1 87.5 84.2 42.5
SPDM 84.3* 80.1 83.7 84.3* 84.2 83.3 75.0
Monk1 88.0 84.5 87.5 87.3 89.6* 83.1 90.3*
Monk2 82.9* 83.1* 75.7 82.9* 75.2 70.1 68.3
Monk3 94.2* 87.7 89.8 92.6 88.0 82.4 87.8
Ionosphere 93.0 84.8 94.0* 89.1 86.5 80.8 89.4
Pima 73.2 73.8 76.8* 76.2* 69.8 72.6 56.0
Data set pca lda nda apac hlda rda sda
ETH-80 64.3 64.3 59.8 73.6 56.5 71.6 70.6
AR database 58.6 77.7 77.0 59.1 67.5 78.6 77.7
SPDM 81.5 66.5 48.8 81.1 65.3 59.5 66.1
Monk1 81.3 69.0 68.3 81.0 84.2 72.0 75.7
Monk2 66.7 67.4 82.6 79.6 83.6 60.0 67.4
Monk3 87.3 70.6 83.6 88.4 84.5 86.3 85.9
Ionosphere 92.1 74.8 88.8 92.1 88.7 82.8 93.4*
Pima 64.3 57.7 69.1 62.5 68.5 66.7 57.7
The linear and nonlinear feature extraction methods described earlier are used
to find an appropriate low-dimensional representation of the data. Here, we use
the classical RBF kernel defined earlier. In this low-dimensional space, we provide
successful classification results using three methods: the classical nearest neighbor
57
Table 2.7: Recognition rates (%) with the classification method of [75].
ETH-80 84.6* 73.9 76.4 84.6* 82.8 72.9 60.3
AR database 89.6* 78.5 85.1 87.5 86.7 85.1 49.5
SPDM 84.9* 75.3 83.9 84.9* 83.4 82.6 75.0
Monk1 88.0* 76.6 82.9 87.3 87.7 88.7* 77.3
Monk2 82.9* 77.5 75.7 82.9* 82.9* 78.5 58.6
Monk3 90.5 83.3 86.3 92.6 92.4 91.2 91.2
Ionosphere 92.8* 84.8 86.1 89.1 86.8 86.8 82.1
Pima 78.6* 76.8 76.2 76.2 73.0 69.0 60.7
ETH-80 67.1 64.3 63.5 71.2 59.1 71.6 72.3
AR database 44.5 70.9 77.3 60.2 67.5 78.6 70.9
SPDM 77.0 56.2 50.2 81.2 53.4 59.5 69.5
Monk1 78.2 67.4 77.8 69.4 71.5 72.0 79.2
Monk2 56.7 70.6 70.6 70.4 58.3 60.0 70.6
Monk3 89.7 70.8 91.9 89.6 93.8* 86.3 90.5
Ionosphere 82.1 74.8 83.4 91.1 94.0* 82.8 89.4
Pima 70.2 57.7 70.2 63.8 72.6 66.7 57.7
(NN) classifier, the extension of K-NN defined in [75], and a linear Support Vector
Machines (SVM). The recognition results are shown in Tables 2.6-2.8.
From these results, it is clear that, on average, the derived KBA criterion achieves
higher classification rates than the Fisher criterion and CV. As expected, KSDA
generally yields superior results than KDA. This is due to the added flexibility on
modeling the underlying class distributions in the kernel space provided by KSDA. To
illustrate the effectiveness of the proposed criterion in KSDA, we show the smooth-
ness of the function optimized by the criterion in Fig. 2.7 for four of the data-sets.
Note how these functions can be readily optimized using gradient ascent. It is also
interesting to note that the optimal value of remains relatively constant for different
58
Table 2.8: Recognition rates (%) with linear SVM.
ETH-80 84.2* 73.6 77.4 84.2* 82.2 71.3 65.3
AR database 86.7* 79.6 83.1 85.3 86.7* 83.1 42.1
SPDM 84.3* 84.6* 82.3 84.3* 83.6 82.6 66.7
Monk1 87.3 88.2 86.1 87.3 89.7* 86.1 88.4*
Monk2 82.9* 81.5 73.8 82.9* 75.2 75.1 50.0
Monk3 93.5 91.9 94.4* 91.9 89.1 81.5 94.4
Ionosphere 92.6 86.1 96.7* 89.1 86.1 82.1 82.1
Pima 79.8* 78.6 79.8* 77.4 75.0 72.8 64.3
ETH-80 60.1 65.3 61.8 68.4 68.4 71.6 67.8
AR database 66.7 79.3 69.7 67.2 70.1 78.6 79.3
SPDM 76.5 50.3 49.0 82.1 69.3 59.5 69.0
Monk1 67.8 65.6 66.4 67.8 68.5 72.0 66.7
Monk2 67.1 67.1 67.5 65.6 67.1 60.0 67.1
Monk3 81.3 63.9 83.3 80.6 81.9 86.3 84.7
Ionosphere 84.8 84.8 88.1 93.4 93.4 82.8 90.1
Pima 68.6 64.9 76.8 77.4 76.2 66.7 64.9
values of Hi . This smoothness in the change of the criterion is what allows to find
the global optimum efficiently.
2.6 Conclusions
In this chapter, we have proposed two approaches to do kernel learning in dis-
criminant analysis. The first approach optimizes the parameters of a kernel whose
function is to map the original class distributions to a space where these are optimally
(w.r.t. Bayes) separated with a hyperplane. We have achieved this by selecting the
kernel parameters that make the class Normal distributions most homoscedastic while
maximizing class separability. Experimental results in a large variety of datasets has
demonstrated that this approach achieves higher recognition rates than most other
59
0.1 0.09 0.3
0.4
0.08
0.08 0.25
(7)
(7)
0.06 0.2
0.07
0.2
0.04
10 0.06
0
20 0.15
5 6 0.05 10
4 10 5
2 0.1
0 0 Hi 0 0 H
i
0.09
0.06
0.08
0.055 0.1
0.08 0.07
0.05 0.08
0.06
0.06 0.045
(7)
0.06
0.05
(7)
0.04
0.04 0.04 0.04
10 0.035 10
0.02 0.03 0.02 0.03
4 5 6 5
H 4
2 i 0.025 2
0 0 0 0 H
i

Figure 2.7: Plots of the value of the derived criterion as a function of the kernel parameter
and the number of subclasses. From left to right and top to bottom: AR, ETH-80, Monk
1, and Ionosphere databases.
60
methods defined to date. We have also shown that adding the subclass divisions to
the optimization process (KSDA) allows the DA algorithm to achieve better gener-
alizations. And, we have formally defined the relationship between KDA and other
variants of DA, such as weighted DA, NDA and SDA.
The second approach we have defined is directly related to the Bayes error. We first
derive a function which computes the Bayes accuracy, defined as one minus the Bayes
error, in the kernel space. Thus, the goal is to find that kernel representation where
the highest classification accuracy is achieved. Extensive experimental results on a
number of databases shows that the derived approach yields superior classification
results to those given by existing algorithms. Moreover, we have demonstrated that,
when used in KSDA, the proposed criterion can accurately recover the underlying
structure of the class distributions.
61
CHAPTER 3
MODEL SELECTION IN KERNEL METHODS IN

REGRESSION
3.1 Introduction
Regression analysis has been a very active topic in machine learning and pattern
recognition, with applications in many problems in science and engineering. In a
standard regression problem, a linear or nonlinear model is estimated from the data
such that the functional relationship between the dependent variables and the inde-
pendent variables can be established. Of late, regression with kernel methods [96, 85]
has become popular. The success of the kernel methods in regression comes from the
fact that they facilitate the estimation of nonlinear function using well-defined and
-tested approaches in, for example, computer vision [99], signal processing [94], and
bioinformatics [76].
In kernel-based regression, the goal is to find a kernel mapping that converts the
original nonlinear problem (defined in the original space) into a linear one (in the
kernel space) [84]. In practise, this mapping is done using a pre-determined nonlinear
function. Given this function, the main challenge is to fine those parameters of the
function that convert a nonlinear problem into a linear one. Thus, the selection of
these kernel parameters is a type of model selection. This is the problem we consider
62
in this chapter to define a criterion for the selection of the appropriate parameters
of this kernel function.
The selection of the appropriate parameters of a kernel is a challenging one [113].
If the parameters were chosen to minimize the model fit, we would generally have an
over-fitting to the training data. As a consequence, the regressed function would not
be able to estimate the testing data correctly. A classical solution is to find a good
fit, while keeping the complexity of the function low, e.g., using a polynomial of lower
order [42]. However, if the parameters are selected to keep the complexity too low,
then we will under-fit the data. In both these cases, the regressed function will have
a poor generalization, i.e., a high prediction error to the testing data. In general, the
kernel parameters should be selected to achieve an appropriate trade-off between the
model fit and model complexity.
As in KDA (Chapter 2), the most widely employed technique to do selection
of the kernel parameters is k-fold cross-validation (CV) [88]. In this approach, the
performance of the prediction models is evaluated by setting aside a validation set
within the training set. The model which produces the smallest validation error is
selected. Unfortunately, this method has three known major drawbacks. First, it is
computational expensive. Second, only part of the training data is used to estimate
the model parameters. When doing model selection, one wants to employ the largest
possible number of training samples, since this is known to yield better generalizations
[63]. Third, the value of k as a parameter plays a major role in the process. Note that
the value of k affects the trade-off between the fitting error and the model complexity,
yet general methods for selecting an appropriate value do not exist.
63
An alternative to CV is Generalized CV (GCV) [37, 96], an efficient approximation
to the leave-one-out CV. GCV has been efficiently applied to some model selection
problems [42, 116]. However, since it approximates the leave-one-out CV, the es-
timated result generally has a large variance, i.e., the regressed function is highly
variant and dependent of the training data.
While a single kernel may not be sufficient to describe the data, multiple kernel
learning (MKL) [51, 87] has attracted much attention recently as a potential alterna-
tive. In [76], MKL is applied to Support Vector Regression (SVR). The coefficients
that determine the combination of kernels are learned using a constrained quadratic
programming problem. This method was shown to outperform CV in some applica-
tions. Unfortunately, the selection of the kernel functions and associated parameters
remains an open problem. In another approach, the regression problem is first refor-
mulated as probabilistic models using Bayesian inference, then the kernel parameters
are selected by maximizing the marginal data likelihood. This approach has been used
to define the well-known Gaussian processes for regression [104]. It has been shown
[80] that the marginal likelihood has the nice property of automatically incorporat-
ing a trade-off between model fit and model complexity. However, since the Bayesian
learning generally leads to analytically intractable posteriors, approximations are nec-
essary, and, the results are generally computationally expensive. Furthermore, the
determination of the priors for the parameters is an intrinsic problem in Bayesian
learning with no clear solution.
In this chapter, we resolve the kernel optimization problem using a completely
novel approach. In our proposed approach, the two measures of model fit and
model complexity are simultaneously minimized using a multiobjective optimization
64
(MOP) framework through the study of Pareto-optimal solutions. MOP and Pareto-
optimality are specifically defined to find the global minima of several combined cri-
teria. To this end, we will first derive a new criterion for model complexity which can
be employed in kernel methods in regression. We then define a method using MOP
and derive a new approach called modified -constraint. We show that this newly
derived approach achieves the lowest mean square error. We provide extensive com-
parisons with the state of the art in kernel methods for regression and on approaches
for model selection. The results show that the proposed framework generally leads
to better generalizations for the (unseen) testing samples.
The remainder of this chapter is organized as follows. In Section 3.2 we derive
the two new measures of model fitness and model complexity. Then, in Section 3.3,
we derive a new MOP approach to do model selection. In Section 3.4, the proposed
framework is applied to two typical kernel methods in regression. Experimental results
are provided in Section 3.5. We conclude in Section 3.6.
3.2 Regression Models
We start with an analysis of the generalization error of a regression model. Given
a training set {(xi , yi )}ni=1 , where xi Rp , yi Rq , with the training samples
(xi , yi ), i = 1, ..., n generated from a joint distribution g(x, y), one wants to find
the regression model f(x) that minimizes the generalization error

Z
E= L(y, f(x))g(x, y)dxdy, (3.1)
where f(x) is the regression function, f(x) = (f1 (x), ..., fq (x))T , fi (.) : Rp R is
the ith regression function, and L(y, f (x)) is a given loss function, for instance, the
Pq
quadratic loss L(y, f(x)) = 12 ky f(x)k22 = 1
2 i=1 (yi fi (x))2 .
65
3.2.1 Generalization error
Holmstrom and Koistinen [44] show that by adding noise to the training samples
(both x and y), the estimation of the generalization error is asymptotically consistent,
i.e., as the number of training examples approaches infinity, the estimated general-
ization error is equivalent to the true one. The addition of noise can be interpreted
as generating additional training samples.
For convenience, denote the training set of n pairs of observation and prediction
vectors by zi = (xi , yi ), i = 1, ..., n, zi Rm , m = p + q. Then, the generalization
error can be rewritten as

Z
E= L(z)g(z)dz. (3.2)
Assume that the training samples zi are corrupted by the noise . Suppose the
distribution of is (). The noise distribution is generally chosen to have zero mean
and to be uncorrelated, i.e.,

Z
i()d = 0, (3.3)
Z
i j ()d = ij , (3.4)
where is the variance of the noise distribution, and ij is the delta function, with
ij = 1 when i = j and ij = 0 otherwise.
We consider the following steps for generating new training samples by introducing
additive noise:
1) Randomly select a sample zi from the training set.
2) Draw a sample noise vector i from ().
3) Set z = zi + i .
66
Thus, the distribution of a particular sample z generated from the training sample
zi is given by ( i ) = (z zi). Then, the distribution of z generated from the entire
training set is
n
1X
g(z) = (z zi ). (3.5)
n i=1
The above result can be viewed as a kernel density estimator of the true distri-
bution of the data g(z) [44]. The distribution of the noise (.) is the kernel function
used in the estimator.
Substituting (3.5) into (3.2), we have

Z
E = L(z)g(z)dz
n Z
1X
= L(z)(z zi )dz. (3.6)
n i=1
Let z zi = i , then (3.6) is reformulated as
n Z
1X
E= L(zi + i )( i )di . (3.7)
n i=1
We expand L(z + ) as a Taylor series in powers of , i.e.,
m m X m
X L(z) 1X 2 L(z)
L(z + ) = L(z) + i + i j + O( 3 ). (3.8)
i=1 zi 2 i=1 j=1 z i zj
Assuming that the noise amplitude is small, the higher order term O(3 ) can be
neglected. Combining (3.8) with (3.7), (3.3) and (3.4), we obtain,

n m
1X 1 X
L(zi ) +
2 L(zi )
E =
n i=1 2 j=1 zj2
n n X m
1X 1 X 2 L(zi )
= L(zi ) + . (3.9)
n i=1 2n i=1 j=1 zj2
67
1 Pq
Let L(z) be the quadratic loss, i.e., L(z) = 2 i=1 (yi fi (x))2 . Then,
m
X 2 L(zi )
j=1 zj2
m X q
1X 2 (yik fk (xi ))2
=
2 j=1 k=1 zj2

q
X p
X 2 2 q 2 2
1
(yik fk (xi )) X
(yik fk (xi ))
= 2
+
2 k=1 j=1 xij j=1 yij2
!2
q p p
X

X fk (xi ) X 2 fk (xi )
= + (fk (xi ) yik ) 2
+ 1
k=1 j=1 xij j=1 x ij
!2
q X
p
X

fk (xi ) 2 fk (xi )
= + (fk (xi ) yik ) + q, (3.10)
k=1 j=1 xij x2ij
where yij is the j th entry of vector yi and xij is the j th entry of vector xi . Substituting
(3.10) into (3.9), we have
E = Ef + Ec , (3.11)
with
n
1 X
Ef = kyi f(xi )k22 (3.12)
2n i=1
and
!2
n X q X p
1 X
fk (xi )
Ec = + (fk (xi ) yik )
2n i=1 k=1 j=1 xij
!
2 fk (xi )
+ p1 . (3.13)
x2ij
Therefore, the generalization error consists of two terms. The first term Ef mea-
sures the discrepancy between the training data and the estimated model, i.e., the
model fit. The second term Ec measures the roughness of the estimated function pro-
vided by the first and second derivatives of the function, i.e., the model complexity. It
controls the smoothness of the function to prevent it from overfitting. The parameter
controls the trade-off between the model fit and model complexity.
68
In order to minimize the generalization error E, we need to minimize both Ef and
Ec . However, due to the bias and variance trade-off [42], a decrease in the model fit
may result in an increase in the model complexity and vice-versa. The regularization
parameter may achieve a balance between the model fit and complexity to some
extent, however, there are two limitations for selecting to do model selection. First,
a good should be chosen beforehand. A common way is to use cross-validation, but
this suffers from several drawbacks as we discussed earlier. Second, note that our goal
is to simultaneously minimize model fit and model complexity. An ideal solution is
that we cannot further decrease one without increasing the other. This means that
even when the appropriate is selected, minimizing E is not directly related to our
goal. To solve these problems, we derive a multiobjective optimization approach in
Section 3.3. We first derive the kernel models for model fit Ef and model complexity
Ec .
3.2.2 Model fit
We start by considering the standard linear regression model, f(x) = WT x, where
W = (w1 , ..., wq ) is a p q weight matrix, with wi Rp . And, we assume all the
vectors are standardized.
We can rewrite the above model as fi (x) = wiT x, i = 1, ..., q. In kernel methods
for regression, each sample x is mapped to (x) in a reproducing kernel Hilbert

T
space as (.) : Rp F . With this, we can write fi (x) = wi (x), i = 1, ..., q.
The Representers Theorem [96] enables us to use wi = (X)i , where (X) =
((x1 ), ..., (xn )) and i is an n 1 coefficient vector. Putting everything together,
69
we get
fi (x) = Ti (X)T (x) = Ti k(x)

n
X
= ij hxj , xi, i = 1, ..., q, (3.14)
j=1
where ij is the j th element in i , and hxj , xi is a kernel function on xj and x.
Using the results derived thus far, we can write Ef as

q
X
Ef = (yi Ki )T (yi Ki ), (3.15)
i=1
where K = (X)T (X) is the n n kernel matrix, yi = (y1i , ..., yni )T is an n 1
vector, and yji is the ith entry of yj .
3.2.3 Roughness penalty in RBF
We now derive solutions of Ec for two of the most used kernel functions, the Radial
Basis Function (RBF) and the polynomial kernels.

kxi xj k2
The RBF kernel is given by hxi , xj i = exp 22
, where is the kernel
Pn
parameter. Since fl (x) = m=1 lm hxm , xi, the partial derivatives are given by
Pn
fl (xi ) m=1 lm hxm , xi i
=
xij xij
Pn 2

m kxm xi k
m=1 l exp 22
=
xij
Pp
Pn m (xmk xik )2
m=1 l exp
k=1
22
=
xij
n
!
1 X
m k xm xi k2
= exp (xmj xij ).
2 m=1 l 2 2
Writing this result in matrix form,

p !2
X fl (xi )
= Tl Ri l ,
j=1 xij
70
2

where Ri = 1
4
Wi WiT , Wi is a np matrix with the j th column equal to exp kx12
xi k
2
2
T
(x1j xij ), . . . , exp kxn2
xi k
2 (xnj xij ) .
And, the second partial derivatives are given by

hP 2
i
2 fl (xi ) 1
n
m=1 lm exp kxm2x2 i k (xmj xij )
= 2
x2ij xij
" n
!
1 X k xm xi k2 (xmj xij )2
= 2 lm exp
m=1 2 2 2
n
!
X k xm xi k2
lm exp
m=1,m6=i 2 2
" n
1 X
= 2 li + lm
m=1
! !#
k xm xi k2 (xmj xij )2
exp 1
2 2 2
=Tl pij ,

2 (xmj xij )2
where pij is an n1 vector whose mth (m 6= i) entry is 1
2
exp kxm2x2 i k 2
1 ,
Pp 2 fl (xi ) Pp
and ith entry is 0. Then j=1 x2ij
= Tl pi , where pi = j=1 pij .
Thus,
p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl pi
j=1 x ij
=Tl (ki pTi )l yil pTi l ,
where ki = (hx1 , xi i, . . . , hxn , xi i)T .
Using the above results, we can define the roughness penalty function in the RBF
kernel space as
q
X
Ec = Tl Ml qTl l + n , (3.16)
l=1
1 Pn 1 Pn
where M = 2n i=1 (Ri + ki pTi ), and ql = 2n i=1 yil pi .
71
1 1
Ef E
f
Ec 0.8 Ec
0.8
0.6
0.6
0.4
0.4 0.2
2 4 6 2 4 6 8
4
x 10
(a) (b)
Figure 3.1: The two plots in this figure show the contradiction between the RSS and the
curvature measure with respect to: (a) the kernel parameter , and (b) the regularization
parameter in Kernel Ridge Regression. The Boston Housing data-set [7] is used in this
example. Note that in both cases, while one criterion increases, the other decreases. Thus,
a compromise between the two criteria ought to be determined.
3.2.4 Polynomial kernel
A polynomial kernel of degree d is given by hxi , xj i = (xTi xj + 1)d . Its partial
derivatives are,
Pn
fl (xi ) lm hxm , xi i
m=1
=
xij xij
Pn
(xTm xi + 1)d
m
= m=1 l
xij
n
X (xTm xi + 1)
= lm d(xTm xi + 1)d1
m=1 xij
Xn
= lm d(xTm xi + 1)d1 xmj
m=1,m6=i
+ 2lid(xTi xi + 1)d1 xij .
We can write the above result in matrix form as

p !2
X fl (xi )
= Tl Bi l ,
j=1 xij
72

where Bi = dCi CTi , Ci is a np matrix with the j th column equal to (xT1 xi + 1)d1 x1j ,
T
. . . , 2(xTi xi + 1)d1 xij , . . . , (xTn xi + 1)d1 xnj .
The second partial derivatives are

n
2 fl (xi ) X m (xTm xi + 1)d1
= l dxmj
x2ij m=1,m6=i xij
h i
(xTi xi + 1)d1 xij
+ 2li d
xij

=dli (xTi xi + 1)d2 3(d 1)x2ij + 2(xTi xi + 1)

n
X
+ (d 1) lm (xTm xi + 1)d2 x2mj
m=1
=Tl gij ,
where gij is a n 1 vector whose mth (m 6= i) entry is d(d 1)(xTm xi + 1)( d 2)x2mj
Pp 2 fl (xi )
and the ith entry is d(xTm xi + 1)d2 3(d 1)x2ij + 2(xTi xi + 1) . Then, j=1 x2ij
=
Pp
Tl gi , where gi = j=1 gij .
Thus,
p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl gi
j=1 xij
=Tl (ki giT )l yil giT l ,
Using the derivations above, the roughness function for the polynomial kernel can be
written as
q
X
Ec = Tl Nl uTl l + n , (3.17)
l=1
1 Pn 1 Pn
where N = 2n i=1 (Bi + ki giT ), and ul = 2n i=1 yil gi .
3.2.5 Comparison with other complexity measure
Thus far, we have introduced a new model complexity measure Ec . Ec is related
to the derivatives of the regressed function f(x). A commonly seen alternative in the
73
literature is the norm of the regression function instead. The L2 norm in the kernel
Hilbert space being the most common of norms used in this approach. This section
provides a theoretical comparison between the approach derived in this chapter and
this classical L2 norm alternative. In particular, we show that the L2 norm does
not penalize the high frequencies of the regression function, whereas the proposed
criterion emphasizes smoothness by penalizing the high frequency components of this
function.
To formally prove the above result, we write generalized Fourier series of f (x),

X
f (x) = ak k (x),
k=0
where {k (x)}
k=0 forms a complete orthonormal basis and ak are the corresponding
coefficients. A commonly used complete orthonormal basis is {sin kx, cos kx}
k=0 in
[, ], with k the index of the frequency component. Using this basis set, f (x) can
be written as

X
f (x) = a0 + (ak sin kx + bk cos kx), (3.18)
k=1
where ak and bk are the coefficients of each frequency components.
Let kf kH be the function norm defining the reproducing kernel Hilbert space, then
the L2 norm of f is
Z
kf k2H = |f (x)|2 dx
Z
!2
X
= a0 + (ak sin kx + bk cos kx) dx
k=1

X
=2a0 + (a2k + b2k ). (3.19)
k=1
Note that in this case, all the coefficient are equal, regardless of the frequency com-
ponent.
74
The complexity measure derived in the present chapter and given in (3.13) can be
reformulated as
!2
Z

f (x) 2 f (x)
Ec = + (f (x) y) dx, (3.20)
x x2
where we have neglected the constant p1 .
Moreover, remember for (3.11) that the generalization error E can be expressed
as f (x) = y + O() [6]. Hence, substituting (3.18) into (3.20), yields

Z
!2
X
Ec = k(ak sin kx + bk cos kx) dx
k=1

X
= k 2 (a2k + b2k ). (3.21)
k=1
Compared to the L2 norm result shown in (3.19), the complexity measure (3.21)
of the proposed approach penalizes the higher frequency components of the regressed
function. This is due to the squared of the index of the frequency component seen in
(3.21). By emphasizing lower frequencies, the proposed criterion will generally select
smoother functions than those selected by the L2 norm method.
A numerical comparion is provided in Section 3.5. To do this, we will need the
explicit equation of the L2 norm of the regression function f in the kernel space. This
is given by,
q
X q X
X n X
n
kfk2H = kfi k2H = ij ik h, xj ih, xk i
i=1 i=1 j=1 k=1
Xq Xn X
n
= ij ik hxj , xk i
i=1 j=1 k=1
Xq
= Ti Ki . (3.22)
i=1
75
3.3 Multiobjective Optimization
The parameters in kernel approaches in regression can now be optimized by simul-
taneously minimizing Ef and Ec of the corresponding fitting function described in
the preceding section. Of course, in general, the global minima of these two functions
are not the same. For instance, a decrease in the fitting error may lead to an increase
in the roughness of the function, and vice-versa. This trade-off is depicted in Figure
3.1. In the plots in this figure, we show the performance of the two criteria with
respect to the their corresponding parameters, i.e., the kernel parameter and the
regularization parameter . As can be observed in the figure, the criteria do not share
a common global minimum. To resolve this problem, we now derive a multiobjective
optimization approach.
3.3.1 Pareto-Optimality
As its name implies, multiobjective optimization (MOP) is concerned with the
simultaneous optimization of more than one objective function. More formally, MOP
can be stated as follows,
minimize u1 (), u2 (), ..., uk ()

(3.23)
subject to S,
where we have k objective functions ui : Rp R, and S Rp is the set of possible vec-
tors. Denote the vector of objective functions by z = u() = (u1 (), u2(), ..., uk ())T ,
and the decision vectors as = (1 , 2 , ..., p )T .
The goal of MOP is to find that which simultaneously minimizes all uj (.). If
all functions shared a common minimum, the problem would be trivial. In general,
however, the objective functions contradict one another. This means that minimizing
76
one function can increase the value of the others. Hence, a compromise solution
is needed to attain a maximal agreement of all the objective functions [66]. The
solutions of the MOP problem are called Pareto-optimal solutions. To provide a
formal definition, let us first state another important concept.
Definition 4. A decision vector 1 is said to dominate 2 if ui( 1 ) ui ( 2 ) for all
i = 1, ..., k and uj ( 1 ) < uj ( 2 ) for at least one index j.
This definition now allows us to give the following formal presentation of Pareto-
optimality.
Definition 5. A decision vector S is Pareto-optimal if there does not exist
another decision vector S for which ui () ui ( ) for all i = 1, ..., k and
uj () < uj ( ) for at least one index j.
In other words, a Pareto-optimal solution is not dominated by any other decision
vector. Similarly, an objective vector z Z(= u(S)) is called Pareto-optimal if the
decision vector corresponding to it is Pareto-optimal. We can see that such a vector is
the one where none of the components can be improved without deteriorating one or
more of the others. In most problems, there will be many Pareto-optimal solutions.
This set of Pareto-optimal solutions is called the Pareto-optimal set or Pareto-frontier.
3.3.2 The -constraint approach
One classical method to find the Pareto-optimal solutions is the -constraint ap-
proach [39]. In this case, one of the objective functions is optimized while the others
are considered as constraints. This is done by defining constraints as upper-bounds
of their objective functions. Therefore, the problem to be solved can be formulated
77
Figure 3.2: Here we show a case of two objective functions. u(S) represents the set of all
the objective vectors with the Pareto frontier colored in red. The Pareto-optimal solution
can be determined by minimizing u1 given that u2 is upper-bounded by .
as follows,
arg min ul ()

subject to uj () j , for all j = 1, ..., k, j 6= l, (3.24)
S,
where l {1, ..., k}.
Figure 3.2 demonstrates the idea behind this approach. In this figure, we show
a bi-objective example, k = 2. The Pareto-optimal solution is determined by
minimizing u1 provided that that u2 is upper-bounded by .
Before exploring the Pareto-optimality of the -constraint method, let us look at
a weaker definition of the term.
78
Definition 6. A decision vector S is weakly Pareto-optimal if there does not
exist another decision vector S such that ui () < ui( ) for all i = 1, ..., k.
From the above definition, we can see that the Pareto-optimal set is a subset of
the weakly Pareto-optimal set and that a weakly Pareto-optimal solution may be
dominated by any Pareto-optimal solution.
It has been shown [66] that the solution of the -constraint method defined in
(3.24) is weakly Pareto-optimal. This means that the solution to (3.24) cannot be
guaranteed to be Pareto-optimal. Although the solution is determined by the pre-
specified upper-bounds j s and some j s may lead to Pareto-optimal solutions, in
practice, we do not know how to choose j s to achieve the Pareto-optimal solutions.
In the following, we propose a modified version of this method and prove that the
solution to this modified approach is guaranteed to be Pareto-optimal.
3.3.3 The modified -constraint
The main idea of our approach is to reformulate the constraints in (3.24) as equal-
ities. This can be achieved if these equalities are multiplied by a scaler smaller than
or equal to s on the right. Formally, uj () = hj j , hj [0, s], for all j = 1, ..., k, j 6= l.
Let h = (h1 , ..., hl1 , hl+1 , ..., hl )T . Then, the modified -constraint method is given
by
k
X
arg min ul () + s hj
,h
j=1,j6=l
subject to uj () = hj j , for all j = 1, ..., k, j 6= l,

(3.25)
0 hj s, for all j = 1, ..., k, j 6= l,
S,
where s is a positive constant. We can now prove the Pareto-optimality of (3.25).
79
Pk
Theorem 7. Select a small scalar s satisfying s j=1,j6=l hj ul (x) ul (x ), where
S and h are the solutions of (3.25). Then, is Pareto-optimal for any given
upper-bound vector = (1 , ..., l1, l+1 , ..., k )T .
Pk
Proof. Let S and h be a solution of (3.25). Since s j=1,j6=l hj ul () ul ( ),
we have ul ( ) ul () for all S when uj ( ) = hj j , for every j = 1, ..., k, j 6= l.
Let us assume that is not Pareto-optimal. In this case, there exists a vector o S
such that ui ( o ) ui ( ) for all i = 1, ..., k and uj ( o ) < uj ( ) for at least one index
j.
If j = l, this means that ul ( o ) < ul ( ). Here we have a contradiction with the
fact that ul ( ) ul () for all S.
If j 6= l, then ul ( o ) ul ( ), uj ( o ) < uj ( ) = hj j and ui ( o ) ui ( ) = hi i
for all i 6= j and i 6= l. Denote ui ( o ) = hoi i , for all i 6= l. Then, we have l 1
inequalities hoi i hi i with at least one strict inequality hoj j < hj j . Canceling out
Pk Pk
i on each of the inequality and taking their sum, yields j=1,j6=l hoj < j=1,j6=l hj .
Pk
This contradicts the fact that the solution to (3.25) minimizes j=1,j6=l hj .
We can demonstrate the utility of this modified -constraint method in the fol-
lowing two examples. In our first example, the objective functions are given by
(
1 x1
u1 (x) =
x2 otherwise
and u2 (x) = (x 5)2 . In our second example, the two functions are given by u1 (x) =
2
1 e(x+1) and
( 2
1 e(x2) x 0.5
u2 (x) =
1 e2.25 otherwise.
In both these examples, we compare the performance of the proposed modified -
constraint approach and the -constraint method. This is illustrated in Figure 3.3. In
80
these figures, the blue stars denote the objective vectors and the red circles represent
the solution vectors given by each of the two methods. We see that in Figure 3.3a and
3.3c, the original -constraint method includes the weakly Pareto-optimal solutions,
whereas in Figure 3.3b and 3.3d the proposed modified approach provides the Pareto-
optimal solutions.
Using the solution defined above, we can formulate the parameter optimization
problem as follows,
arg min Ef () + sh
,h
subject to Ec () = h (3.26)
0 h s.
Note that given different s, we may have different Pareto-optimal solutions.
In our parameter optimization problem, we only need one Pareto-optimal solution.
Hence, our next goal is to define a mechanism to determine an appropriate value for
To resolve this problem, we select such that the corresponding Pareto-optimal
objective vector is as close to the ideal point as possible. Specifically, let be a
Pareto-optimal solution given , then the optimal is

2
= arg min wf Ef ( ) zf + wc (Ec ( ) zc )2 , (3.27)

where zf and zc are the ideal vectors of Ef () and Ec (), respectively, and wf , wc are
the weights associated to each of the objective functions. The incorporation of these
weights can drive the optimization to favor one objective function over the other. If
Ef ( ) (or Ec ( )) is close to its ideal value zf (zc ), then wf (wc ) should be relatively
small. But if Ef ( ) (Ec ( )) is far apart from it ideal value zf (zc ), then wf (wc )
81
40 40
objective points objective points
30 solution points 30 solution points
2
2
20 20
u
u
10 10
0 0
0 20 40 60 80 0 20 40 60 80
u u
1 1
(a) (b)
1 1
objective points objective points
0.98 0.98
solution points solution points
0.96 0.96
2
0.94
u2
0.94
u
0.92 0.92
0.9 0.9
0.88 0.88
0.7 0.8 0.9 1 0.7 0.8 0.9 1
u1 u1
(c) (d)
Figure 3.3: Comparison between the proposed modified and the original -constraint meth-
ods. We have used * to indicate the objective vector and o to specify the solution vector.
Solutions given by (a) the -constraint method and (b) the proposed modified -constraint
approach on the first example, and (c) the -constraint method and (d) the modified -
constraint approach on the second example. Note that the proposed approach identifies the
Pareto-frontier, while the original algorithm identifies weakly Pareto-solutions, since the
solution vectors go beyond the Pareto-frontier.
82
Algorithm 3.1 Modified -constraint algorithm
Input: Training set {(x1 , y1 ), ..., (xn , yn )}, 0 , h0 , 0 , s.
1. Calculate the ideal vector point (zf , zc ).
2. Specify the weights wf and wc using (3.28).
3. Obtain using (3.27).
4. Obtain using (3.26).
Return: The optimal model parameter .
should be large. This can be formally stated as follows,
wf = |Ef ( 0 ) zf |2 ,
wc = |Ec ( 0 ) zc |2 , (3.28)
where 0 is the initialization for . The proposed modified -constraint approach is
summarized in Algorithm 3.1.
3.3.4 Alternative Optimization Approaches
Thus far, we have derived a MOP approach for model selection based on Pareto-
optimality. The most pressing question for us is to show that this derived solution
yields lower prediction errors than simpler, more straight forward approaches. Two
such criteria are the sum and product of the two terms to be minimized [113], given
by
Qsum () = Ef () + Ec (). (3.29)
and
Qpro () = Ef ()Ec () , (3.30)
where and are regularization parameters needed to be selected. Note that mini-
mizing (3.30) is equivalent to minimizing
lg Qpro () = lg Ef () + lg Ec (). (3.31)
83
which is the logarithm of (3.30). We could use cross-validation to select the regular-
ization parameters and . Experimental results comparing these two alternative
optimization approaches with the proposed approach will be given in the experiments
section.
3.4 Applications to Regression
Let us derive two kernel-based regression approaches using the kernels and MOP
criteria derived above. In particular, we use our derived results in Kernel Ridge
Regression (KRR) and Kernel Principal Component Regression (KPCR).
3.4.1 Kernel Ridge Regression
Ridge regression (RR) is a penalized version of the ordinary least squares (OLS)
solution. More specifically, RR regularizes the OLS solution with a penalty on the
norm of the weight factor. This regularization is used to avoid overfitting. Formally,
RR is defined as
wi = (XXT + Ip )1 Xyi , i = 1, ..., q, (3.32)
where X = (x1 , ..., xn ), Ip is the p p identity matrix, yi = (y1i , ..., yni )T , and is
the regularization parameter.
We can now extend the above solution using the kernel trick. The resulting method
is know as Kernel Ridge Regression (KRR), and is given by
i = (K + In )1 yi , i = 1, ..., q, (3.33)
where, as above, K is the kernel matrix.
84
In KRR, there are two parameters to optimize: the kernel parameter (e.g., in
the RBF kernel) and the regularization parameter . In the following, we derive a
gradient descent method to simultaneously optimize the two.
Since both, the residual sum of squares term ER and the curvature term EC , are
involved in our parameter optimization problem, we need to derive the gradient of
these terms with respect to their parameters.
We start with the derivations of the RBF kernel. In this case, we have
Pq
Ec Ki )T (yi Ki )
i=1 (yi
=

Xq
(yi Ki )
= 2 (yi Ki )T
i=1
q !
X K i
T
= 2 (yi Ki ) i + K ,
i=1
K 1
where
= 3
K D, defines the Hadamard product of two matrices of the same
dimensions, i.e., (A B)ij = Aij Bij , with Aij denoting the (i, j)th entry of matrix
i
A. D = [k xi xj k2 ]i,j=1,...,n is the matrix of pairwise sample distances, and
=
(K+In )1

yi = (K + In )1 K

(K + In )1 yi = (K + In )1 K . And,
i

Ec q
X Tl Ml qTl l + n
=
l=1
q
n X
1 X (Tl Ri l ) (kTi l ) T
= + pi l
2n i=1 l=1
!
(pTi l )
(kTi l yil ) ,

where
WiT
(Tl Ri l ) 2Tl Wi
l + WiT

l
4 3 Tl Ri l
= ,
4
WiT kxj xi k2 kxj xi k2

is a n p matrix whose (j, k)th entry is 3
exp( 22
)(xjk xik ), and
(kTi l ) kTi l
= l + kTi ,

85
(pTi l ) pTi l
= l + pTi ,

ki K pi
T Pp pT pT
is the ith column of

, = j=1
ij
, ij

is a n 1 vector whose mth (m 6= i)
h
2 (xmj xij )2
entry is 1
3
exp( kxm2x2 i k ) ( 12 k xm xi k2 ) 2
1
i
22 (xmj xij )2 and ith entry is 0.
Seemingly, deriving with respect to the regularization parameter , yields
Xq
ER (yi Ki )
= 2 (yi Ki )T
i=1
q
X i
= 2 (yi Ki )T K ,
i=1
i (K+In )1
where
=
yi = (K + In )1 (K + In )1 yi = (K + In )1 i . And,

q T M qT + n
Ec X l l l l
=
l=1
q !
X l l
= 2Tl M qTl
l=1
When using the polynomial kernel, we cannot employ a gradient descent technique
for finding the optimal value of d, because this is discrete. Thus, we will have to try
all possible discrete values of d (within a given range) and select the degree yielding
the smallest error. The derivations of Ef with respect to are the same for any
Pq
kernel, and Ec

= l=1 2Tl N

l
uTl

l
.
3.4.2 Kernel Principal Component Regression
Solving an overdetermined set of equations is a general problem in pattern recog-
nition. The problem is well studied when there are no collinearities (i.e., close to
linear relationships among variables), but special algorithms are needed to deal with
them. Principal Component Regression (PCR) is a regression approach designed to
86
deal with collinearities in the exploratory variables. Instead of using the original pre-
dictor variables, a subset of principal components of these are selected. By deleting
the principal components with small variances, a more stable estimate of the coef-
ficient {wi }i=1,...,q can be obtained. In this way, the large variances of {wi }i=1,...,q ,
which were caused by multicollinearities, will be greatly reduced. More formally,
m
X 1
wi = aj aTj Xyi , i = 1, ..., q, (3.34)
j=1 lj
where ai is the eigenvector of the covariance matrix associated to the ith largest
eigenvalue.
The above formulation can once again be calculated in the kernel space as,
m
X 1
i = vj vjT yi , i = 1, ..., q, (3.35)
j=1 j
f associated to the ith largest

where vi is the eigenvector of the centered kernel matrix K
eigenvalue i . This algorithm is known as Kernel Principal Component Regression
(KPCR).
In KPCR, we need to optimize two parameters the kernel parameter and the
number of eigenvectors m we want to keep. Since m is discrete, the cost function
with respect to m is non-differentiable. But testing all possible value for m is compu-
tationally expensive, because the range of m is dependent on the size of the training
set. Here, we present an alternative approach to select the optimal subset. The basic
idea is to use the percentage of the variance r to determine the number of principal
Pm
i f Note that r can now change continuously
components, r = Pi=1
t

, t is the rank of K.
i=1 i
(from 0 to 1) and can thus be incorporated in a gradient descent framework.
87
Since KPCR differs from KRR in the solution vector {i }i=1,...,q , we need to derive
i . The derivative with respect to is given by,
Xm 1 vj vT
i j j
= yi
j=1
m
!
X 1 i T 1 vi T 1 viT
= 2 vi vi + v + vi yi ,
j=1 i j i j
i
where
= viT K v,
i
vi

= (K i Id )+ K v [59], and A+ is the pseudoinverse of
i
the matrix A.
The partial derivative with respect to r cannot be given, because an explicit
definition of i as a function of r does not exist. We resolve this issue by deriving an

i
approximation to r
using a Taylor expansion. That is,
r 2
i (r + r) = i (r) + ri (r) + (r)
2! i
r 3
+ i (r) + O(r 4 ),
3!
r 2
i (r r) = i (r) ri (r) + (r)
2! i
r 3
(r) + O(r 4 ).
3! i
Combining the two equations above, we have
i (r + r) i (r r)
i (r) = + O(r 2 ).
2r
Therefore, we can write

Pm2 1 T
i i (r + r) i (r r) j=m1 +1 j vj vj yi
= ,
r 2r 2r
Pm1 Pt Pm1 +1 Pt
where m1 and m2 are selected such that i=1 i / i=1 i r r < i=1 i i=1 i
Pm2 Pt Pm2 +1 Pt
and i=1 i / i=1 i r + r < i=1 i / i=1 i .
88
Table 3.1: Results for KRR. Mean RMSE and standard deviation (in parentheses).
Kernel RBF Polynomial

Data set/Method Modified - -constraint CV GCV Modified - -constraint CV GCV
constraint constraint
Housing 2.89*(0.77) 3.01(0.78) 3.25(0.84) 4.01(1.01) 3.71(0.87) 4.38(0.99) 4.24(1.03) 8.67(6.78)
Mpg 2.51*(0.52) 2.59(0.57) 2.72(0.40) 2.61(0.52) 2.82(0.45) 3.25(0.58) 3.24(0.57) 3.21(0.80)
Slump 6.62*(1.49) 7.36(2.29) 6.70(1.53) 22.1(8.95) 7.09(1.22) 8.85(2.05) 9.86(1.53) 7.20(1.77)
Price 2.21*(0.90) 2.73(1.54) 2.42(0.90) 8.88(5.43) 3.08(1.20) 3.29(1.50) 4.01(1.48) 3.41(1.5)
Diabetes 0.55*(0.23) 0.72(0.33) 0.57(0.19) 0.88(0.31) 0.52*(0.17) 0.60(0.20) 2.31(0.87) 0.62(0.33)
Wdbc 31.46*(1.59) 32.15*(4.86) 31.50*(4.37) 50.30(9.13) 34.11(4.23) 35.12(5.21) 46.61(6.89) 32.04*(4.35)
Servo 0.51*(0.29) 0.56(0.30) 0.59(0.32) 0.81(0.52) 0.70(0.25) 0.70(0.25) 0.75(0.25) 0.65(0.27)
Puma-8nm 1.44*(0.02) 1.51(0.03) 2.42(0.05) 1.44*(0.03) 1.42*(0.02) 1.89(0.04) 1.89(0.04) 1.46(0.02)
Puma-8nh 3.65(0.03) 3.64(0.03) 3.98(0.06) 3.56*(0.04) 5.08(1.26) 5.28(0.19) 4.11(0.14) 3.61*(0.06)
Puma-8fm 1.13*(0.01) 1.19(0.02) 1.19(0.01) 1.14*(0.02) 1.27(0.01) 1.37(0.09) 1.29(0.005) 1.27(0.01)
Puma-8fh 3.23*(0.01) 3.45(0.02) 3.23*(0.01) 3.23*(0.01) 3.78(0.16) 4.86(0.14) 3.23(0.01) 3.24*(0.02)
Kin-8nm 0.11*(0.002) 0.15(0.003) 0.16(0.002) 0.19(0.02) 0.18(0.0008) 0.24(0.03) 0.22(0.002) 0.19(0.01)
Kin-8nh 0.18*(0.001) 0.18(0.002) 0.19(0.002) 0.18(0.002) 0.20(0.002) 0.29(0.006) 0.24(0.003) 0.22(0.003)
Kin-8fm 0.016(0.002) 0.016(0.03) 0.013*(0.0001) 0.339(0.202) 0.013*(0.0001) 0.02(0.0001) 0.16(0.003) 0.015(0.0001)
Kin-8fh 0.07(0.002) 0.061(0.002) 0.043*(0.0002) 0.043*(0.0002) 0.046(0.0002) 0.046(0.0002) 0.16(0.003) 0.050(0.0003)
In each kernel, the best result is in bold. The symbol * is used to indicate the top result
over all methods and kernels.
Table 3.2: Results for KPCR. Mean RMSE and standard deviation (in parentheses).
Kernel RBF Polynomial

Data set/Method Modified - -constraint CV GCV Modified - -constraint CV GCV
constraint constraint
Housing 4.04*(0.88) 4.56(0.67) 9.14(1.10) 11.99(6.89) 8.45(1.72) 9.12(2.30) 6.05(0.95) 9.37(1.77)
Mpg 3.00*(0.58) 4.64(0.82) 7.71(0.90) 3.64(1.63) 7.30(0.81) 7.82(1.54) 5.92(1.00) 8.16(1.78)
Slump 6.39*(1.53) 7.55(1.68) 9.28(1.94) 7.64(1.42) 7.68(1.88) 8.15(2.11) 8.48(2.80) 9.49(3.00)
Price 3.90*(2.16) 4.67(2.15) 12.62(2.02) 9.78(2.98) 6.06(1.93) 6.27(2.29) 5.79(1.49) 6.61(1.61)
Diabetes 0.76*(0.33) 0.96(0.43) 0.99(0.53) 0.74*(0.34) 1.01(1.47) 0.73*(0.80) 1.85(1.92) 1.23(1.31)
Wdbc 30.66*(4.71) 35.32(5.87) 33.5(4.53) 43.53(7.05) 34.47(10.27) 56.68(7.71) 47.21(13.44) 41.13(14.89)
Servo 0.71*(0.30) 1.35(0.33) 1.41(0.34) 1.29(0.40) 1.13(0.25) 1.11(0.25) 0.74(0.24) 0.81(0.24)
Puma-8nm 3.69(0.02) 3.66(0.02) 2.42(0.05) 1.75*(0.07) 3.71(0.32) 4.12(0.25) 4.13(0.53) 4.15(0.70)
Puma-8nh 4.39(0.04) 4.39(0.02) 4.56(0.13) 3.65*(0.08) 4.58(0.29) 4.84(0.22) 4.56(0.16) 5.59(0.58)
Puma-8fm 1.28*(0.05) 1.73(0.77) 4.04(1.13) 1.26*(0.01) 1.29*(0.005) 1.46(0.36) 1.56(0.61) 1.81(0.44)
Puma-8fh 3.22*(0.01) 3.33*(0.28) 3.49(0.08) 3.26*(0.07) 3.75(0.24) 3.92(0.41) 3.99(0.39) 5.04(0.74)
Kin-8nm 0.19*(0.01) 0.19*(0.01) 0.22(0.02) 0.22(0.01) 0.22(0.04) 0.21(0.03) 0.26(0.05) 0.30(0.07)
Kin-8nh 0.21*(0.007) 0.21*(0.01) 0.23(0.01) 0.24(0.01) 0.25*(0.05) 0.30(0.09) 0.27(0.05) 0.33(0.07)
Kin-8fm 0.05(0.01) 0.06(0.04) 0.03(0.007) 0.04(0.01) 0.02*(0.0001) 0.05(0.08) 0.08(0.11) 0.10(0.13)
Kin-8fh 0.06*(0.01) 0.07(0.02) 0.05(0.006) 0.06*(0.01) 0.07*(0.07) 0.07*(0.07) 0.12(0.12) 0.12(0.12)
89
3.5 Experimental results
In this section, we will use the Pareto-optimal criterion derived in this chapter to
select the appropriate kernel parameters of KRR and KPCR. Comparisons with the
state of the art as well as the alternative criteria (i.e., sum and product) defined in
the preceding section are provided.
3.5.1 Standard data-sets
We select fifteen data-sets from the UCI machine learning databases [7] and
the DELVE collections [29]. Specifically, these databases include the following sets
(in parenthesis we show the number of samples/number of dimensions): Boston
housing (506/14), auto mpg (398/8), slump(103/8), price(159/16), diabetes(43/3),
wdbc(194/33), servo(167/5), puma-8nm (8192/9), puma-8nh (8192/9), puma-8fm
(8192/9), puma-8fh (8192/9), kin-8nm (8192/9), kin-8nh (8192/9), kin-8fm (8192/9)
and kin-8fh (8192/9). The Boston housing data-set was collected by the U.S. Census
Service and describes the housing information in Boston, MA. The task is to predict
the median value of a home. The auto mpg set details fuel consumption predicted
in terms of 3 discrete and 4 continuous attributes. In the slump set, the concrete
slump is predicted by 7 different ingredients. The price data-set requires predicting
the price of a car based on 15 attributes. In the diabetes set, the goal is to predict the
level of the serum C-peptide. In the Wisconsin Diagnostic Breast Cancer (wdbc) set,
the time of the recurrence of breast cancer is predicted based on 32 measurements
of the patients. The servo set concerns a robot control problem. The rise time of a
servomechanism is predicted based on two gain settings and two choices of mechan-
ical linkages. The task in the Pumadyn is to predict angular accreditation from a
90
simulation of the dynamics of a robot arm. And, the Kin set requires us to predict
the distance of the end-effector from a target in a simulation of the forward dynamics
of an 8 link all-revolute robot arm. There are different scenarios in both Pumadyn
and Kin data-sets according to the degree of non-linearity (fairly-linear or nonlinear)
and the amount of noise (moderate or high).
To test our approach, for each data-set, we generate five random permutations and
conduct 10-fold cross-validation on each one. The mean and the standard deviations
are reported. In the experiments, we use the root mean squared error (RMSE) as our
measure of the deviation between the true response yi and the predicted response yi ,
Pn 1/2
i.e., RMSE = [n1 i=1 (yi yi )2 ] .
When using the -constraint criterion, we employ the interior-point method of
[47]. Recall that in our proposed modified -constraint criterion, we also need to
select a small scalar s. In all our experiments s = 103 .
We compare our approaches to the two typical criteria used in the literature, Cross-
Validation (CV) and Generalized Cross-Validation (GCV) [37, 96]. In particular, we
employ a 10-fold CV. The kernel parameter in the RBF is searched in the range
[ 2, + 2], where and are the mean and standard deviation of the distances
between all pairwise training samples. In the polynomial kernel, its degree is tested in
the range of 1 to 6. The regularization parameter in KRR is selected among the set
{105 , . . . , 104 }, and the percentage of variance r in KPCR is searched in the range
[0.8, 1]. Moreover, we compare our modified -constraint approach with the original
-constraint method.
91
Table 3.1 shows the regression results of KRR using both the RBF and the poly-
nomial kernels. A two-sided paired Wilcoxon signed rank test is used to check sta-
tistical significance. The error in bold is significantly smaller than the others at
significance level 0.05. We see that regardless of the kernel used, the proposed mod-
ified -constraint approaches consistently provide the smallest RMSE. We also note
that the modified -constraint approach obtains smaller RMSE than the -constraint
method.
Table 3.2 shows the regression results of KPCR using the RBF and polynomial
kernels. Once more, the proposed approach generally outperforms the others. Ad-
ditionally, as in KRR, the modified -constraint approach generally yields the best
results.
A major advantage of the proposed approach over CV is that it uses all the
training data for training. In contrast, CV needs to use part of the training data for
verification purposes. This limits the amount of training data used to fit the function
to the data.
3.5.2 Comparison with the state of the art
We now provide a comparison with the methods available in the literature and
typically employed in the above databases. Specifically, we compare our results with
Support Vector Regression (SVR) [93] with the RBF and polynomial kernels, Multiple
Kernel Learning in SVR (MKL-SVR) [76], and Gaussian Processes for Regression
(GPR) [104]. In SVR, the parameters are selected using CV. In MKL-SVR, we
employ three kernel functions: the RBF, the polynomial and the Laplacian defined as

kxi xj k
k(xi , xj ) = exp
. The RBF kernel parameter is set to be the mean of the
92
Table 3.3: Mean and standard deviation of RMSE of different methods.
Data set/Method Modified -constraint SVRrbf SVRpol MKL-SVR GPR
Housing 2.89(0.77) 3.45(1.04) 5.66(1.88) 3.34(0.70) 3.05(0.82)
Mpg 2.51(0.52) 2.69(0.60) 4.03(0.96) 2.67(0.61) 2.64(0.50)
Slump 6.62(1.49) 6.77(1.90) 8.37(2.86) 6.90(1.41) 6.88(1.51)
Price 2.21(0.90) 2.40(0.84) 3.72(1.55) 2.51(0.91) 11.2(2.26)
Diabetes 0.55(0.23) 0.68(0.31) 0.78(0.39) 0.65(0.35) 0.59(0.20)
Wdbc 31.46(1.59) 32.08(4.76) 44.1(9.87) 32.20(4.65) 31.60(4.3)
Servo 0.51(0.29) 0.61(0.35) 1.37(0.41) 0.60(0.36) 0.57(0.30)
Puma-8nm 1.44(0.02) 1.44(0.03) 3.35(0.11) 1.51(0.02) 1.47(0.03)
Puma-8nh 3.65(0.03) 3.67(0.06) 4.55(0.07) 3.78(0.05) 3.65(0.03)
Puma-8fm 1.13(0.01) 1.17(0.02) 2.04(0.05) 1.21(0.03) 1.17(0.02)
Puma-8fh 3.23(0.01) 3.24(0.02) 3.84(0.06) 3.35(0.05) 3.23(0.01)
Kin-8nm 0.11(0.002) 0.12(0.002) 0.21(0.003) 0.16(0.03) 0.12(0.002)
Kin-8nh 0.18(0.001) 0.19(0.003) 0.23(0.01) 0.20(0.002) 0.18(0.002)
Kin-8fm 0.016(0.002) 0.043(0.002) 0.048(0.001) 0.045(0.002) 0.013(0.00009)
Kin-8fh 0.07(0.002) 0.047(0.0009) 0.06(0.006) 0.05(0.001) 0.043(0.0007)
Table 3.4: Comparison of our results with the state of the art.
Housing Mpg Slump Price Diabetes servo Puma-8nm
Best 3.46(0.93) 2.67(0.61) 6.79(1.89) 2.62(0.87) 0.68(0.25) 0.59(0.30) 1.47(0.03)
Ours 2.89(0.77) 2.51(0.50) 6.62(1.49) 2.21(0.90) 0.55(0.23) 0.51(0.29) 1.44(0.02)
Puma-8nh Puma-8fm Puma-8fh Kin-8nm Kin-8nh Kin-8fm Kin-8fh
Best 3.65(0.03) 1.17(0.02) 3.23(0.01) 0.12(0.002) 0.18(0.002) 0.013(0.00009) 0.043(0.0007)
Ours 3.65(0.03) 1.13(0.01) 3.23(0.01) 0.11(0.002) 0.18(0.002) 0.016(0.002) 0.07(0.002)
distances between all pairwise training samples; the degree of the polynomial kernel is
2 Pn Pn
set to 2; and in the Laplacian kernel is set as = n(n+1) i=1 j=i k xi xj k, where
n is the number of training samples. MOSEK [3] is used to solve the quadratically
constrained programming problems to get the combinational coefficients of the kernel
matrices. In GPR, the hyperparameters of the mean and covariance functions are
determined by minimizing the negative log marginal likelihood of the data.
We compare the results given by the above algorithms with those obtained by our
approach applied to KRR and using the RBF kernel, because this method tends to
93
Table 3.5: Regression performance with alternative optimization criteria.
Method KRRR KRRP PCRR PCRP
Data set Ours Sum Product Ours Sum Product Ours Sum Product Ours Sum Product
Housing 2.89(0.77) 3.06(0.78) 3.30(0.85) 3.71(0.87) 4.75(0.89) 4.66(1.75) 4.04(0.88) 9.46(5.73) 6.48(3.85) 8.45(1.72) 5.56(6.77) 4.97(3.98)
Mpg 2.51(0.52) 2.63(0.50) 2.64(0.49) 2.82(0.45) 4.34(2.51) 18.04(2.59) 3.00(0.58) 4.56(0.83) 4.25(0.69) 7.30(0.81) 5.48(6.04) 4.29(3.21)
Slump 6.62(1.49) 6.87(1.51) 6.85(1.51) 7.09(1.22) 8.03(1.78) 13.17(2.84) 6.39(1.53) 7.65(1.86) 7.64(1.87) 7.68(1.88) 14.70(9.04) 16.79(9.10)
Price 2.21(0.90) 2.73(1.45) 2.76(1.46) 3.08(1.20) 2.72(1.11) 3.10(1.11) 3.90(2.16) 4.17(2.85) 4.17(2.86) 6.06(1.93) 8.76(1.99) 13.43(20.9)
Diabetes 0.55(0.23) 0.66(0.29) 0.75(0.33) 0.52(0.17) 0.60(0.21) 3.47(2.15) 0.76(0.33) 0.86(0.45) 0.86(0.43) 1.01(1.47) 1.09(1.56) 0.65(0.22)
Wdbc 31.46(1.59) 47.99(7.01) 48.60(8.98) 34.11(4.23) 51.31(16.98) 64.29(73.02) 30.66(4.71) 34.01(4.86) 38.91(5.31) 34.47(10.27) 56.61(19.90) 56.58(32.64)
Servo 0.51(0.29) 0.60(0.31) 0.57(0.30) 0.70(0.25) 0.94(0.36) 1.33(0.40) 0.71(0.30) 0.83(0.50) 0.86(0.49) 1.13(0.25) 0.70(0.27) 0.91(0.33)
Puma-8nm 1.44(0.02) 1.50(0.03) 1.50(0.03) 1.42(0.02) 3.40(0.59) 3.89(0.04) 3.69(0.02) 2.25(0.38) 2.37(0.53) 3.71(0.32) 8.50(0.87) 6.96(3.00)
Puma-8nh 3.65(0.03) 3.80(0.03) 3.86(0.04) 5.08(1.26) 4.36(0.29) 4.54(0.03) 4.39(0.04) 3.90(0.64) 3.97(0.65) 4.58(0.29) 13.82(3.81) 11.27(5.41)
Puma-8fm 1.13(0.01) 1.18(0.01) 1.18(0.01) 1.27(0.01) 1.52(0.48) 2.79(0.12) 1.28(0.05) 2.58(0.97) 2.09(0.84) 1.29(0.005) 6.98(1.23) 3.88(2.81)
Puma-8fh 3.23(0.01) 3.28(0.02) 3.28(0.02) 3.78(0.16) 3.81(0.38) 3.79(0.08) 3.22(0.01) 3.40(0.38) 3.30(0.12) 3.75(0.24) 9.78(5.64) 7.82(5.44)
Kin-8nm 0.11(0.002) 0.12(0.008) 0.13(0.01) 0.18(0.0008) 0.21(0.02) 0.68(0.35) 0.19(0.01) 0.21(0.02) 0.22(0.02) 0.22(0.04) 0.27(0.24) 0.26(0.25)
Kin-8nh 0.18(0.001) 0.19(0.002) 0.19(0.002) 0.20(0.002) 0.22(0.005) 0.42(0.27) 0.21(0.007) 0.22(0.002) 0.23(0.01) 0.25(0.005) 0.50(0.47) 0.29(0.01)
Kin-8fm 0.016(0.002) 0.020(0.0005) 0.020(0.0005) 0.013(0.0001) 0.020(0.0001) 0.57(0.22) 0.05(0.01) 0.07(0.03) 0.06(0.01) 0.02(0.0001) 0.11(0.29) 0.03(0.01)
Kin-8fh 0.07(0.002) 0.05(0.0007) 0.046(0.0005) 0.046(0.0002) 0.05(0.0001) 0.75(0.30) 0.06(0.01) 0.08(0.03) 0.08(0.02) 0.07(0.07) 0.13(0.25) 0.06(0.02)
yield more favorable results. The comparisons are shown in Table 3.3. Note that our
approach generally yields smaller RMSE.
Furthermore, for each of the data-sets described above, we provide a comparison
between our results and the best results found in the literature. For the Boston
housing data-set, [91] reports the best fits with Relevance Vector Machine (RVM);
for the Auto mpg data-set, the best result is obtained by MKL-SVR [76]; for the
Slump data, [22] proposes a k nearest neighbor based regression method and shows
its superiority over others; for the price data-set, [100] reports the best result with
pace regression; Diabetes data-set is used in [24] and the best results is obtained
using Least Angle Regression; for the servo data-set, [26] shows that regression with
random forests gets best results; and for the last eight data-sets, Gaussian processes
for regression trained with a maximum-a-posteriori approach is generally considered
to provide state of the art results [103]. The comparison across all the data-sets is
given in Table 3.4. We see that our approaches provide better or comparable results
to the top results described in the literature but with the main advantage that a
single algorithm is employed in all data-sets.
94
Table 3.6: Comparison with L2 norm.
Method KRRR KRRP PCRR PCRP
Data set Ours L2 norm Ours L2 norm Ours L2 norm Ours L2 norm
Housing 2.89(0.77) 3.45(0.95) 3.71(0.87) 4.96(0.92) 4.04(0.88) 4.36(0.96) 8.45(1.72) 7.40(1.72)
Mpg 2.51(0.52) 3.09(0.51) 2.82(0.45) 4.19(2.23) 3.00(0.58) 3.45(0.75) 7.30(0.81) 7.42(1.29)
Slump 6.62(1.49) 6.98(1.48) 7.09(1.22) 14.97(2.23) 6.39(1.53) 6.43(1.47) 7.68(1.88) 8.12(2.08)
Price 2.21(0.90) 2.81(1.21) 3.08(1.20) 2.45(3.77) 2.35(1.04) 2.73(1.31) 6.06(1.93) 5.88(1.73)
Diabetes 0.55(0.23) 0.68(0.25) 0.52(0.17) 0.78(0.20) 0.76(0.33) 0.87(0.43) 1.01(1.47) 0.94(1.40)
Wdbc 31.46(1.59) 32.10(4.56) 34.11(4.23) 42.69(13.41) 30.66(4.71) 30.69(4.66) 34.47(10.27) 45.79(15.69)
Servo 0.51(0.29) 0.90(0.31) 0.70(0.25) 0.96(0.34) 0.71(0.30) 0.73(0.31) 1.13(0.25) 1.03(0.25)
Puma-8nm 1.44(0.02) 1.47(0.03) 1.42(0.02) 3.84(0.04) 3.69(0.02) 3.37(0.04) 3.71(0.32) 4.21(0.16)
Puma-8nh 3.65(0.03) 3.75(0.03) 5.08(1.26) 4.66(0.06) 4.39(0.04) 4.19(0.14) 4.58(0.29) 4.61(0.31)
Puma-8fm 1.13(0.01) 1.23(0.01) 1.27(0.01) 1.63(0.49) 1.28(0.05) 1.26(0.003) 1.29(0.005) 1.58(0.64)
Puma-8fh 3.23(0.01) 3.23(0.01) 3.78(0.16) 4.06(0.03) 3.22(0.01) 3.30(0.12) 3.75(0.24) 3.97(0.52)
Kin-8nm 0.11(0.002) 0.17(0.001) 0.18(0.0008) 0.21(0.03) 0.19(0.01) 0.16(0.03) 0.22(0.04) 0.22(0.03)
Kin-8nh 0.18(0.001) 0.20(0.001) 0.20(0.002) 0.26(0.007) 0.21(0.007) 0.21(0.002) 0.25(0.005) 0.29(0.09)
Kin-8fm 0.016(0.002) 0.020(0.0003) 0.013(0.0001) 0.024(0.0005) 0.05(0.01) 0.03(0.003) 0.02(0.0001) 0.05(0.08)
Kin-8fh 0.07(0.002) 0.06(0.0007) 0.046(0.0002) 0.067(0.0005) 0.06(0.01) 0.06(0.004) 0.07(0.07) 0.05(0.003)
3.5.3 Alternative Optimizations
In Section 3.3.4, we presented two alternatives for combining different objective
functions the sum and the product criteria. Here we provide a comparison of these
criteria and the approach derived in this chapter. In particular, we combine model
fit Ef and model complexity Ec via the summation and product in KRR and KPCR.
The regularization term in (3.29) and in (3.31) is selected by 5-fold CV. Table
3.5 shows the corresponding regression results. In this table, AR and AP denote the
method A with a RBF and a polynomial kernel, respectively. We see that these two
alternative criteria generally perform worse than the Pareto-optimal based approach.
3.5.4 Comparison with the L2 norm
We give a comparison between our complexity measure Ec and the commonly used
L2 norm. The results are shown in Table 3.6. We see that the proposed complexity
measure generally outperforms the L2 norm in penalizing the regression function.
95
3.5.5 Age estimation
In the last two sections we want to test the derived approach on two classical
applications age estimation from faces and weather prediction.
The process of aging can cause significant changes in human facial appearances.
We used the FG-NET aging database described in [2] to model these changes. This
data-set contains 1,002 face images of 82 subjects at different ages. The age ranges
from 0 to 69. Face images include changes in illumination, pose, expression and
occlusion (e.g., glasses and beards). We warp all images to a standard size and
constant position for mouth and eyes as in [60]. All the pictures are warped to a
common size of 60 60 pixels and converted to 8-bit graylevel images. Warped
images of one individual are shown in Figure 3.4. We represent each image as a
vector concatenating all the pixels of the image, i.e., the appearance-based feature
representation.
We generate five random divisions of the data, each with 800 images for training
and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. We can see
that the modified -constraint method outperforms the other algorithms. In [115],
the authors represent the images using a set of highly redundant Haar-like features
and select relevant features using a boosting method. We implemented this method
using the same five divisions of the data. Our approach is slightly better using a
simpler appearance-based representation.
96
Figure 3.4: Sample images showing the same person at different ages.
Table 3.7: MAE of the proposed approach and the state of the art in age estimation.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR [115]
MAE 5.85 6.59 13.83 6.46 6.95 7.18 15.46 5.97
3.5.6 Weather prediction
The weather data of the University of Cambridge [102] is used in this experiment.
The maximum temperature of a day is predicted based on several parameters mea-
sured every hour during the day. These parameters include pressure, humanity, dew
point (i.e., the temperature at which a parcel of humid air must be cooled for it to
condense), wind knots, sunshine hours and rainfall. We use the data in a period of
five years (2005-2009) for training and the data between January and July of the year
2010 for testing. This corresponds to 1,701 training samples and 210 testing samples.
The results are in Table 3.8. In [77], the authors employed support vector regression
and report state of the art results. Our experiment shows that our approach performs
better than their algorithm. The predictions obtained from the modified -constraint
97
Table 3.8: RMSE of several approaches applied to weather prediction.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR
RMSE 0.81 0.83 0.90 0.87 0.95 1.07 2.53
approach are also plotted in Figure 3.5. We observe that our approach can provide
the prediction of the daily maximum temperature with high accuracy.
3.6 Conclusions
Non-linear regression is a fundamental problem in machine learning and pattern
recognition with multiple applications in science and engineering. Many approaches
have been proposed for linear regressions, but their non-linear extensions are known
to present several limitations. A major limitation is the lack of regularization of the
regressor. Without proper regularization, the complexity of the estimated function
(e.g., the degree of the polynomial describing the function) could increase rapidly,
yielding poor generalizations on the unseen testing set [74]. To resolve this prob-
lem, we have derived a roughness penalty that measures the degree of change (of the
regressed function) in the kernel space. This measure can then be used to obtain esti-
mates that (in general) generalize better to the unseen testing set. However, to achieve
this, the newly derived objective function needs to be combined with the classical one
measuring its fitness (i.e., how well the function estimates the sample vectors). Clas-
sical solutions would be to use the sum or product of the two objective functions [113].
However, we have shown that these solutions do not generally yield desirable results
in kernel methods in regression. To resolve this issue, we have proposed a multiple
98
30
true
Max temperature
predicted
20
10
0
0 100 200
Days
Figure 3.5: This figure plots the estimated (lighter dashed curve) and actual (darker dashed
curve) maximum daily temperature for a period of more than 200 days. The estimated
results are given by the algorithm proposed in this chapter.
99
optimization approach based on the idea of Pareto-optimality. In this MOP frame-
work, we have derived a novel method: the modified -constraint approach. While
the original -constraint method cannot guarantee Pareto-optimal solutions, we have
proven that the derived modified version does. Extensive evaluations with a large
variety of databases has shown that this proposed modified -constraint approach
yields better generalizations than previously proposed algorithms.
The other major contribution of the chapter has been to show how we can use the
derived approach for optimizing the kernel parameters. In any kernel method, one
always has to optimize the parameters of the kernel mapping function. The classical
approach for this task is CV. This technique suffers from two main problems. First,
it is computationally expensive. Second, and arguably most important, it cannot use
the entire sample set for training, because part of it is employed as a validation set.
But, we know that (in general) the larger the training set, the better. Our proposed
MOP framework is ideal for optimizing the kernel parameters, because it yields nice
objective functions that can be minimized with standard gradient descent techniques.
We have provided extensive comparisons of the proposed approach against CV
and GCV and the other state of the art techniques in kernel methods in regression.
We have also compared our results to those obtained with the sum and product
criteria. And, we have compared our results to the best fits found in the literature for
each of the databases. In all cases, these comparisons demonstrate that the proposed
approach yields fits that generalize better to the unseen testing sets.
100
CHAPTER 4
LOCAL DENSITY ADAPTIVE KERNELS
4.1 Introduction
The performance of the kernel methods greatly depends on the selection of the
kernel functions. An appropriate kernel function can lead to a substantial improve-
ment in the generalization ability of the learning approaches [69, 105, 10, 19]. Ideally,
the choice of the kernel function is based on the prior knowledge of the problem do-
main. Unfortunately, in general, we do not have prior knowledge on the data, and
thus have no clue on which kernel to use.
One of the most commonly used kernels in the literature is the Radius Basis Func-

kxi xj k2
tion (RBF), defined as, k(xi , xj ) = exp
, where is a kernel parameter.
In this kernel, data sample evaluation is equivalent to the likelihood calculation based
on Parzen windows [73, 25], which is a non-parametric density estimator. The Parzen
window size (i.e., the kernel parameter ) significantly affects the algorithms perfor-
mance. This parameter controls the size of the neighborhood centered at the point
that is being evaluated. Estimates with too large a will suffer from oversmoothing
(where the real underlying structure is obscured), while a too small will lead to a
wiggly estimate (which has too much statistical variability). It is important to note
that an important assumption associated with the use of a fixed is that the same
101
(a) (b)
Figure 4.1: A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density Adaptive (LA)
kernels are evaluated on the four points marked by . (a) Density estimation in the RBF
kernel uses a fixed window, illustrated by black circles. Note that this fixed window cannot
capture different local densities. (b) Density estimation with the proposed LA kernel.
Gaussian distribution is imposed on the neighborhood of every data sample. This
means that the use of a fixed-shape kernel is only reasonable for evenly distributed
data.
However, in practice, the data is usually drawn from a complex distribution where
the local regions have distinct densities. In such cases, a kernel with a fixed shape
such as the RBF kernel will not perform well because it cannot adapt to local changes.
This problem is illustrated in Figure 4.1(a). In this figure, we see that the RBF kernel
parameter would fit some local regions well, but would not be appropriate for other
local regions with distinct densities. In these cases, the well-known overfitting and
underfitting problems [42] occur.
102
A solution to this problem is to vary the kernel bandwidth of the Parzen density
estimate based on local densities. Some approaches have been proposed in the density
estimation literature. One well-known method is the k-nearest neighbor estimate [56],
where the density is estimated by varying the window size to accommodate k-nearest
samples of a given feature vector.
A related class of approaches is called adaptive kernel estimate [9, 89, 48] which
explicitly modifies the window size according to the local data distributions. These
approaches have been shown to provide improved performance in density estimation.
However, these methods could not be directly used in most kernel-based approaches
for classification, because the resulting kernel is not guaranteed to be a Mercer kernel
[84], i.e., the corresponding kernel matrix is not positive semi-definite. This will
indeed lead to several significant problems. First, a kernel function which is not
positive semi-definite will not induce a reproducing kernel Hilbert space [84]. If the
inner product is not well defined, then the kernel trick cannot be used. Second, in
Support Vector Machines (SVM) [92], the geometric interpretation (i.e., maximizing
the margin) is only available in the case of positive semi-definite and conditionally
positive semi-definite functions [82]. Also, in such cases, the solution is unique since
the optimization problems in SVM is convex.
This chapter proposes a new class of kernels called Local-density Adaptive (LA)
kernels, which are guaranteed to be Mercer kernels. Thus, our kernels can be directly
used in any kernel-based approaches such as Kernel Discriminant Analysis (KDA) [67,
5], Kernel Principal Component Analysis (KPCA) [83, 71] and Kernel SVM (KSVM)
for nonlinear feature extraction and classification. The similarity of the pairwise
samples defined by LA kernels are constrained by the local density information, which
103
is calculated based on a weighted local variance measure. Thus, our kernels can
adaptively fit the local shape of the data while evaluating the sample similarities.
4.2 Local Density Adaptive Kernels
4.2.1 Motivation
When a kernel-based approach is employed in learning, a specific kernel function
must be selected. The Radial Basis Function (RBF) kernel is a popular choice. The
kernel parameter in this kernel is fixed for the entire data. Instead of using a
single for the estimate, it is also possible to represent the distribution using a
diagonal matrix with each diagonal entry measuring the variance of each dimension,

Pp (xli xlj )2
i.e., k(xi , xj ) = exp l=1 l
, where xli is the lth dimension of sample xi and p
is the dimension of the input space. Alternatively, we can use a full covariance matrix
M, i.e., k(xi , xj ) = exp((xi xj )T M1 (xi xj )), where is a scaling parameter.
This is known as the Mahalanobis kernel [1].
It is important to note that the evaluation in the above kernels assumes the data
is Gaussian distributed with fixed variance over the entire feature space. The key
idea of this chapter is to build a kernel which can automatically vary its shape (i.e.,
local variance) to adapt to local data structures.
A possible approach would be to adopt the local covariance matrix, which char-
acterizes the local structure of the data. Thus, a possible kernel function can be
formally given as,
k(xi , xj ) = exp((xi xj )T 1
ij (xi xj )) (4.1)
where ij = (i + j )/2 and i and j are the local covariance matrices centered
on the sample xi and xj , respectively. ij is a pooled covariance matrix which
104
characterizes the local density information in the neighborhoods of xi and xj . The
estimation of a local covariance matrix i centered on xi can be obtained from the
k-nearest neighbors of xi .
Eq. (4.1) seems a reasonable kernel function, since the likelihood calculation is
now given by the local distribution. However, this function is not a Mercer kernel.
Note that if a kernel function k(xi , xj ) is a Mercer kernel, there exists a mapping
function (.) : Rp F such that k(xi , xj ) = (xi )T (xj ). The kernel function in
(4.1) can be rewritten as
k(xi , xj ) = exp((xi xj )T ATij Aij (xi xj ))
= exp((Aij xi Aij xj )T (Aij xi Aij xj ))
= exp((zi zj )T (zi zj )), (4.2)
where 1 T
ij = Aij Aij and zi = Aij xi . Since (4.2) is an RBF kernel w.r.t z, there
exists a function mapping (.) : Rp F and
k(xi , xj ) = (zi )T (zj )
= (Aij xi )T (Aij xj )
= ij (xi )T ij (xj ), (4.3)
where ij (x) = (Aij x). Since ij (.) is dependent on the samples in the input space,
there does not exist a unique mapping for the kernel function in (4.1). This implies
that (4.1) is not a Mercer kernel.
4.2.2 Defining Mercer kernels
Our goal is to derive a Mercer kernel which calculates the likelihood from the
density estimation. Such a kernel k(xi , xj ) can be designed as a multiplication of two
105
kernel functions k1 (xi , xj ) and k2 (xi , xj ), i.e.,
k(xi , xj ) = k1 (xi , xj )k2 (xi , xj ). (4.4)
If k1 and k2 are both Mercer kernels, then k is also a Mercer kernel [84]. k1 can be
selected to be a likelihood evaluation kernel, such as the RBF or Mahalanobis kernel.
Then we need to build k2 , which measures the local density. To derive k2 , let us start
by presenting an important result.
Theorem 8. A kernel function (xi , xj ) = q(xi )q(xj ) is a Mercer kernel, if q(x) is
a non-negative function on x.
Proof. Let q = (q(x1 ), q(x2 ), . . . , q(xn ))T be a n 1 vector, n the number of the
samples. Then the kernel matrix K can be written as K = qqT . Thus, for any
Rn ,
T K = T qqT = (T q)2 0.
This means that the kernel matrix K is positive semi-definite. And, hence, the kernel
(xi , xj ) is a Mercer kernel.
We could thus define k2 as k2 (xi , xj ) = (xi )(xj ), where (x) 0, for all x. (x)
should be designed to reflect the density information in the neighborhood of x. One
way to achieve this is to measure the variance of the data in the neighborhood of x.
Formally,
k
1X
(x) = kxi xk2 . (4.5)
k i=1
Eq. (4.5) measures the local variance in the neighborhood of x, characterized by
the k-nearest neighbors of x. This means the local variance information is calculated
106
only from the k samples which are closest to x and the influence of the other samples
are not considered. More generally, (x) can be defined as

Pn
i=1 hx (xi )kxi xk2
(x) = Pn , (4.6)
i=1 hx (xi )
where hx (xi ) is a weighting function (i.e., a kernel) which depends on x. We note
that (4.5) is a special case of (4.6). To see this, we first denote Nk (x) as the set of
samples that are the k nearest neighbors of x. Then, a uniform kernel hx (.) is defined
as
(
1
m
, xi Nk (x)
hx (xi ) =
0, otherwise,
where m is a normalizing factor that ensures the kernel integrates to 1. This makes
(4.6) equivalent to (4.5).
Alternatively, we can incorporate the influences of all the samples in the input
space, as the soft neighborhood used in kernel regression [70, 42]. The weight of
each sample xi is calculated based on its distance from x. In this chapter, we adopt
2

the Gaussian kernel, hx (xi ) = 1
2
exp kxi2xk
2 , where is a scaling parameter.
Therefore, our local variance measure for sample xi is formally defined as

Pn
kxj xi k2
j=1,j6=i exp 2 2
kxj xi k2
(xi ) = Pn
kxj xi k2

j=1,j6=i exp 2 2
Pn
kxj xi k2
j=1 exp 2 2
kxj xi k2
= Pn
kxj xi k2
. (4.7)
j=1 exp 2 2
1
Note that (4.7) can be rewritten as (xi ) = tr(xi ), where tr(.) is the trace of a
matrix, and xi is the local covariance matrix

Pn 2
j=1,j6=i exp kxj2
xi k
2 (xj xi )(xj xi )T
x i = Pn 2
.
j=1,j6=i exp kxj2
xi k
2
The equation above shows the relationship of (4.7) and the local covariance ma-
trices, which encode the information of local distributions. To demonstrate that (4.7)
107
5
0
100
50
0 5
10
5
0
5
10 10
Figure 4.2: This figure illustrates how the local variance measurement given by (4.7) is
used. The axis represents the magnitude of the variance around each sample.
can appropriately measure the local density information, we calculate the local vari-
ances of the data in Figure 4.1 using (4.7). The results are shown in Figure 4.2. The
axis represents the local variance around each sample. We see that this local variance
measure effectively captures the local density information. The local variances are
smaller for the samples in the high density regions, and larger for the samples in the
low density regions.
It now seems that (4.7) can be readily used in our LA kernel approach. However,
a limitation of (4.7) is that it is dependent of the scale of the data since it is related
to the distances of pairwise samples. For instance, if we apply (4.7) to a large-scale
data-set, the resulting kernel matrix could have very large values in each entry, which
would lead to numerical problems. Thus, an appropriate normalization procedure
108
should be added. One way to solve this is to normalize (4.7) with the average of the
local variances about each sample, i.e.,
(xi )
s (xi ) = 1 Pn , (4.8)
n i=1 (xi )
where s (xi ) is the scale-free local variance measure.
Combining the above results, we can define our proposed LA kernel function
k(xi , xj ) as
k(xi , xj ) = s (xi )k1 (xi , xj )s (xj ). (4.9)
Recall that k1 (xi , xj ) can be any likelihood evaluation kernel function with a fixed
shape such as the RBF kernel or the Mahalanobis kernel.
Note that the kernel defined in (4.9) falls into the class of conformal kernels [84],
which define a conformal transformation preserving the angles in the kernel space.
Wu and Amari [107] use a conformal kernel to increase the influence of the samples
located around the decision boundary in an attempt to improve SVM classification.
This conformal kernel is modified in [106] to adaptively address the class imbalance
problem in SVM. Later, Gonen and Alpaydin [38] extend conformal kernels to multiple
kernel learning. In the present work, we have derived a completely different conformal
function s (x) which encodes the local density information such that the kernel can
adaptively vary its shape to fit the local data.
4.2.3 Window size
Our kernel function in (4.9) calculates the similarity of pairwise samples based
on the likelihood of the local densities. This is equivalent to evaluating the local
likelihood using windows of different sizes. A large-size window is used for the regions
where samples are distributed sparsely, while a small-size window is applied to the
109
regions where the data density is high. An advantage of the proposed kernel function
is that it can achieve this goal without changing the window size explicitly. To see
this, consider the case where the neighborhood around sample xi is sparse. That
means the local variance of xi is very large, yielding large s (xi ). When a kernel
function such as the RBF is multiplied by s (xi ), the resultant likelihood becomes
large, which is equivalent to using a large-size window (large in the RBF case).
The case for a high density region can be similarly observed. Therefore, our kernel
can adaptively change the window size of neighborhoods with different densities in
an implicit way.
Moreover, note that a fixed-shape kernel function such as the RBF kernel is a
special case of our kernel, where the local variance measure s (xi ) is a constant for
every sample xi . Thus the function does not need to incorporate information of the
local density.
4.2.4 A case study
We provide a case study with the purpose of demonstrating the utility and advan-
tages of the newly derived kernel. We employ the RBF function in k1 (xi , xj ), since
we want to make a comparison between the proposed kernel and the RBF.
We generated a set of 500 samples for each of the two classes in the XOR problem,
Figure 4.3 (a). Each class is represented by a mixture of two Gaussians, i.e., two
subclasses per class. The means of these 4 subclasses are designed so that the data is
distributed in a XOR fashion, Figure 4.3 (a). In each class, the covariance matrices
of each subclass have different scales, controlled by a factor c such that Si2 = cSi1 ,
where Sij denotes the covariance matrix of the j th subclass in class i. The larger c is,
110
4
LA kernel
RBF kernel
2
0.85
Classification accuracy
0
0.8
2
0.75
4
0.7
6
8 0 20 40 60 80 100 120
5 0 5 Covariance factor c
(a) (b)
Figure 4.3: (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under different covariance
factors c. The proposed kernel obtains higher classification accuracies than the RBF as c
increases.
the more different the two covariance matrices become. Thus, this data-set provides
the performance evaluation of the kernels under different conditions where the local
regions have different densities.
We let KSVM be our classifier. The kernel parameters in the RBF kernel and
the proposed LA kernel are tuned using 5-fold cross-validation (CV). We then calcu-
late the classification accuracies using an independent test set drawn from the same
distributions.
We plot the classification accuracies with respect to different covariance factors c
in Figure 4.3 (b). We see that as c increases, the RBF results degrade rapidly, whereas
that of the LA kernel does not. This is because as the local regions become more
different in density, the LA kernel adapts to this density differences. We demonstrate
the utility of the proposed LA kernel using a variety of data-sets in Section 4.4.
111
4.3 Kernel Parameter Selection
Now that we have derived the LA kernel, the next question to answer is how to
select the kernel parameters. Given the kernel function, the success of the kernel
approach greatly depends on the selection of its kernel parameters. Next, we present
two mechanisms to achieve this goal.
4.3.1 k-fold cross-validation
A commonly used criterion to do parameter selection is k-fold cross-validation
(CV) [42]. The training data is first partitioned into k parts. k-1 parts are used for
training, the other for validation. This process is repeated k times for each possible
value of the parameters. The parameters leading to the largest average validation
accuracy are selected.
4.3.2 Kernel Bayes accuracy criterion
A first major problem with CV is its complexity. The training process has to be
repeated k times and the parameters are selected based on an exhaustive grid search.
When it is applied to a large-scale database, it becomes very time consuming, which
limits its use in practice. The seconde major problem of CV is that only part of the
training data is used to estimate the model parameters. In general, one wishes to use
the largest possible number of training samples in search of better generalizations [63].
Here, we explore the KBA criterion [114] as in (5.7). It is an efficient approximation
of the Bayes classification accuracy in the kernel space.
We use this criterion to determine the optimal kernel parameters, i.e.,
, = arg max J(, ). (4.10)

,
112
We present the results on various classification problems. We provide results with
k1 (xi , xj ) in (4.9) equal to the RBF and the Mahalanobis kernels, which we denote
LAR and LAM , respectively. We provide a comparison of the performance of our LA
kernels with the classical use of RBF and the Mahalanobis (denoted MA).
We apply kernel learning to three well-known approaches: KSVM, KDA and
KSDA [113]. The parameter , and in these kernels, as well as the number of
subclasses of each class in KSDA are selected using the two criteria defined above:
5-fold CV and KBA. The regularization parameter in KSVM is also selected using
CV. In KDA and KSDA, we employ the nearest mean (NM) and the nearest neighbor
(NN) classifiers, denoted as KDAN M , KDAN N , KSDAN M and KSDAN N respectively.
4.4.1 UCI benchmark data-sets
We apply the derived LA kernels to seven benchmark data-sets from the UCI
repository [7]. In the Monks problem, the goal is to discriminate two distinct postures
of a robot. Monk 1, 2, and 3 denote three different cases in this task. The NIH Pima
data-set is used for the detection of diabetes from eight measurements. In the BUPA
set liver disorders are detected from a blood test. The task in the Breast Cancer
data-set is to distinguish two classes: no-recurrence and recurrence. And, the goal of
the image segmentation data-set is to classify seven outdoor object categories from a
set of 3 3 image patches.
The classification results of these data-sets using CV are presented in Table 4.1.
In KDA and KSDA, the proposed kernels generally outperform the RBF and the
Mahalanobis kernels, regardless of the classifiers used in the reduced space. A similar
113
Table 4.1: Recognition rates (%) with CV in UCI data-sets.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 87.7 83.6 90.3 83.1 90.3 83.1 88.2 88.0 88.4 87.5
Monk 2 85.2 82.6 72.5 70.1 74.5 70.1 73.8 74.5 74.1 75.7
Monk 3 97.0 93.5 91.7 82.4 92.1 82.4 94.0 94.0 94.2 89.8
Pima 78.6 79.2 80.4 72.6 72.6 72.6 79.2 77.4 77.4 74.4
Liver 71.0 68.1 66.7 69.6 63.8 69.6 69.6 66.7 65.2 65.2
B. Cancer 72.8 70.1 68.8 67.5 68.8 66.2 68.8 59.7 66.2 71.4
Image-seg 93.3 91.2 93.1 90.7 94.1 93.0 93.1 90.7 94.1 93.0
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 89.6 82.6 84.5 81.0 85.2 81.9 85.0 81.0 85.0 81.9
Monk 2 83.6 82.4 71.5 73.8 75.5 77.8 79.6 78.9 81.3 81.3
Monk 3 94.2 93.1 94.4 93.1 92.6 91.7 94.4 93.1 92.6 94.0
Pima 76.2 76.2 79.8 78.6 76.2 75.0 78.6 76.8 76.2 72.6
Liver 73.9 72.5 71.1 68.1 71.1 68.1 71.1 66.7 71.1 63.8
B. Cancer 74.0 68.8 66.2 72.7 68.8 62.2 66.2 63.6 68.8 65.0
Image-seg 92.2 93.4 92.1 91.5 92.4 91.8 91.1 90.9 91.1 90.6
The higher classification accuracies are bolded.
observation can be made in KSVM, where the proposed LA kernels provide higher
classification accuracies. The results with the KBA criterion are shown in Table 4.2.
We see that although the kernel parameters are now selected using a different criterion,
the proposed kernels still outperform classical kernels in most of the data-sets.
4.4.2 Image databases
To further demonstrate the utility of the proposed LA kernels in real-world ap-
plications, we apply them to two image databases. The first database we will use
is the ETH-80 [53]. This database is described in the previous chapters. We adopt
the typical leave-one-object-out test, i.e., the 41 images of one of the 80 objects are
114
Table 4.2: Recognition rates (%) with KBA criterion in UCI data-sets.
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 94.7 94.0 86.8 87.3 87.5 87.3 87.0 88.0 88.1 89.4
Monk 2 82.2 79.9 78.7 78.2 79.2 82.9 79.2 78.2 80.6 82.9
Monk 3 96.5 95.1 94.9 92.6 93.1 92.6 96.3 94.0 95.1 94.7
Pima 77.4 81.5 81.6 76.2 78.6 76.2 81.6 78.6 78.6 73.2
Liver 72.5 68.1 65.2 65.2 60.9 63.7 69.6 65.2 66.7 59.4
B. Cancer 72.7 70.1 67.5 66.2 62.3 61.0 67.5 63.6 66.2 64.9
Image-seg 91.3 91.3 88.0 92.0 93.0 94.1 90.2 92.0 93.2 92.9
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 85.2 84.7 82.9 85.2 83.1 85.2 85.0 84.0 83.8 83.6
Monk 2 83.1 83.8 82.2 83.3 82.6 83.6 80.6 79.0 83.1 81.5
Monk 3 94.0 92.8 93.3 91.4 92.1 91.4 94.7 93.1 94.7 93.3
Pima 82.1 76.8 81.0 76.2 74.4 70.8 78.6 78.0 72.6 76.8
Liver 73.9 71.0 69.6 68.1 69.6 68.1 69.6 68.1 69.6 68.1
B. Cancer 70.1 68.8 63.6 62.3 66.2 62.3 66.2 67.5 67.5 61.0
Image-seg 91.5 89.0 92.1 90.0 92.7 90.1 91.5 90.0 92.3 90.1
used for testing and the images of the rest of the objects are used for training. This
process is repeated 80 times and the average recognition rate is reported.
The results are shown in Table 4.3 and 4.4. We see that our kernels generally out-
perform the RBF and the Mahalanobis kernels. Note that there is a big improvement
in KDA and KSDA in Table 4.3.
We also use the CMU PIE face database [86]. This database contains 68 subjects
with a total of 41,368 images. The face images were obtained under varying pose,
illumination and expression. We select the five near-frontal poses (C05, C07, C09,
C27, C29) and use all the images under different illuminations and expressions -
around 170 images for each person.
115
Table 4.3: Recognition rates (%) with CV in ETH-80.
KSVM KDAN M KDAN N KSDAN M KSDAN N
LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
83.6 81.8 80.4 71.6 80.2 71.6 80.4 71.6 80.2 71.6
LAM MA LAM MA LAM MA LAM MA LAM MA
77.0 74.6 76.6 70.2 77.0 70.4 76.6 70.2 77.0 70.4
Table 4.4: Recognition rates (%) with KBA criterion in ETH-80.

KSVM KDAN M KDAN N KSDAN M KSDAN N
LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
82.0 81.6 81.2 84.6 81.6 84.6 81.2 84.6 81.6 84.6
LAM MA LAM MA LAM MA LAM MA LAM MA
75.0 73.6 77.8 71.5 76.6 70.8 77.8 71.5 76.6 70.8
Figure 4.4: Shown here are sample images from PIE data-set.
116
Table 4.5: Recognition rates (%) with CV in PIE database.
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 72.5 69.7 75.3 72.6 75.9 73.1 75.3 72.6 75.9 73.1
20 89.2 87.6 93.4 92.1 94.2 92.4 93.3 84.8 94.3 84.8
40 94.7 93.2 96.3 94.6 96.7 94.9 96.5 92.6 96.8 92.6
60 96.6 95.5 97.3 96.1 97.4 96.5 97.3 96.4 97.4 96.4
80 98.4 98.0 97.7 96.8 98.0 97.2 97.7 96.5 98.0 96.4
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 73.6 71.2 66.9 61.0 66.8 61.0 66.9 61.0 66.8 61.0
20 89.6 88.5 89.8 85.8 89.7 85.7 88.0 83.3 87.9 83.1
40 93.4 92.7 91.2 89.6 91.4 89.6 92.5 90.7 92.3 90.6
60 95.8 94.9 93.3 91.2 93.3 91.5 93.3 91.2 93.3 91.5
80 97.8 96.7 95.7 93.4 95.9 93.2 95.7 93.4 95.9 93.2
All the face images were aligned, cropped and resized to a standard size of 32 32
pixels. Some sample images are shown in Figure 4.4. For each individual, we randomly
selected N (N=5, 20, 40, 60, 80) images for training and used the rest for testing.
The comparative results obtained from KSVM, KDA and KSDA are shown in Table
4.5 and 4.6. The LA kernel consistently achieves better recognition performance than
the RBF and the Mahalanobis kernels. Again, this illustrates the effectiveness of the
proposed approach.
4.5 Conclusions
The selection of a kernel function is a main issue in kernel-based learning. An ap-
propriately selected kernel function greatly increases the generalization performance
of the learning approach. This chapter proposes a class of density adaptive Mercer
kernels which evaluate the sample similarity by taking into account the local data
117
Table 4.6: Recognition rates (%) with KBA criterion in PIE database.
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 73.6 70.1 75.2 72.6 75.8 73.0 75.2 72.6 75.8 73.0
20 88.7 81.7 91.8 84.8 92.0 84.8 92.0 72.1 92.1 72.1
40 94.7 91.4 95.2 93.2 95.9 93.3 96.1 93.2 96.2 93.3
60 96.7 95.0 97.0 95.5 96.9 95.5 97.3 95.5 97.4 95.5
80 97.6 96.9 97.5 96.6 97.7 96.6 97.7 96.6 98.0 96.6
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 65.2 60.6 61.5 57.1 61.4 57.1 64.9 59.6 64.8 59.6
20 91.2 80.9 88.3 84.0 88.2 84.0 84.7 79.3 84.7 79.3
40 95.3 92.5 93.3 91.6 93.2 91.5 93.3 91.6 93.2 91.5
60 96.4 93.9 94.5 92.8 94.5 92.8 94.5 92.8 94.5 92.8
80 98.0 96.5 96.6 93.4 96.7 93.5 96.6 93.4 96.7 93.5
density. While the commonly used kernels such as the RBF and the Mahalanobis
kernels evaluate the entire data using a fixed window, the kernels derived in this
chapter can automatically adjust their window size to adapt to local regions with
different densities. This enables them to effectively handle data with multiple distri-
bution forms. The proposed LA kernel approach was successfully applied to KSVM,
KDA and KSDA and shown to yield higher classification accuracies than classical
options, e.g., RBF and Mahalanobis kernels.
118
CHAPTER 5
KERNEL MATRIX LEARNING WITH GENETIC

ALGORITHMS
5.1 Introduction
Thus far, we have focused on the model selection problem where the kernel pa-
rameters are learned given a known kernel function. In many applications, however,
we do not have prior knowledge of the data. Thus, we do not know which kernel
function may perform better. A major open problem in kernel learning is to define
algorithms that find the kernel mapping function best suited to most problems. Ide-
ally, we want to find an appropriate kernel mapping without having to pre-specify
the kernel function (such as the typically employed RBF kernel).
Instead of learning the kernel parameters of a given kernel function, one could
try to directly learn the kernel matrix. Multiple kernel learning attempts to do just
that by combining a set of known kernel maps. For example, Cristianini et al. [18]
represent the kernel matrix as a linear combination of several pre-defined kernels.
The coefficients determining how to combine the kernels are learned by aligning the
matrices with a target label matrix. Other authors, [51, 4, 49, 111] employ convex
optimization techniques within the context of Support Vector Machines (SVM) and
Kernel Discriminant Analysis (KDA). [50] proposes to learn the kernel matrix by
119
penalizing an Lp norm of the combination coefficients, leading to a more general
framework of multiple kernel learning. And, Crammer et al. [15] propose a boosting
approach based on the exponential and logarithmic loss. Finally, several nonlinear
combinations of kernels have also been recently defined [95, 14].
The multiple kernel learning approach just described, however, suffers from two
main limitations. First, an explicit formulation to combine different kernels has to
be pre-specified. As it is common, some methods work best in one application while
others outperform them in different settings. Second, the kernel matrix can only be
searched within the space defined by these pre-specified functions. If the kernels and
their parameters are not appropriately specified, the learned kernel matrix will not
perform well in classification.
In this chapter, we derive an approach that overcomes these difficulties. Our ap-
proach borrows ideas from Genetic Algorithms (GA) modify a large set (population)
of randomly initialized kernel matrices to optimize the metric induced by the kernel
mapping without the need to know the underlying kernel function. By doing so, we
also avoid the need to combine or optimize over several possible (or known) kernel
matrices.
Key to our approach is the definition of several novel operators in GA. The two
classical operators used in the literature are crossover and mutation. The former,
combines two or more individuals of the current population to generate an individual
of the next generation (called offspring). The second operator in GA adds random
mutations to existing individuals. These two procedures are however not sufficient to
efficiently search vast spaces [68], such as the one defining all possible kernel matrices.
In the present work, we derive three additional GA operators to facilitate this search.
120
One of the new operators emulates gene transposition. Consider the genome of a
species. Transposons are chunks of DNA that can move from one part of this genome
to another. This process was first described by Nobel Laureate Barbara McClintock
[64], when she noticed that the color changing pattern seen in corn is not random.
This effect was originally referred to as jumping genes. A typical gene transposition
is given by the cut-and-paste transposon. Here, enzymes cut a section of the DNA
and then insert it elsewhere. In our case, each genome describes a kernel matrix. A
cut-and-paste transposon will move a section of the matrix to another. As a result,
the classification function seen in one area of the feature space will now be applied
to another section of the space. If this results in a lower classification error rate, the
new matrix will be preferred over the old ones.
Another typical operator is insertions [81]. A typical case is that of viruses.
Lacking a reproductive system, viruses need to insert their genome into that of the
invaded cell for replication. By doing so, gene coding and non-coding sections of the
host genome can vary. In our case, the insertion of a new section in the matrix could
resolve misclassification in a localized section of the feature space.
Our third operator is deletion. In living organisms, sections of the genome may
be deleted during meiosis [81]. In our case, deletion of a section of the matrix could
rearrange the classifiers (i.e., norm defined by the kernel) in a positive way.
The GA operators defined above facilitate the search through a vast domain, thus
addressing the problem of multiple kernel learning listed above. After the matrices
of the current population have been modified to create the offsprings, we eliminate
those yielding the worst sample classification accuracies. The process is iterated until
convergence.
121
A problem with approaches that directly learn the kernel matrix (with no known
associated kernel function) is that they lack the capacity to map the test samples
to the kernel space. A common solution is to employ transductive learning [33].
Here, the testing data is used in combination with the training samples to resolve
the problem. Each time the testing data changes, the algorithm will compute a new
kernel matrix which can be used for both, the training and testing sets. This approach
is computationally expensive and is not guaranteed to provide good results on the
test data because the kernel matrix has not been optimized for them.
To resolve these problems, we derive a regression-based method which estimates
the kernel values encoding the similarity between the training and testing samples
allowing us to map any new test sample. This eliminates the need of having to
relearn the kernel matrix each time a new test sample is to be classified. Our solution
is equivalent to estimating the underlying function represented by the learned kernel
matrix. We show that this approach yields superior results to transductive learning
since it directly represents the learned function rather than the training sample alone.
The rest of the chapter is organized as follows. Section 5.2 introduces the nuts
and bolts of the proposed genetic algorithm search. Section 5.3 derives the non-
linear regression learning of the underlying function defined by the kernel solution
for its application in classification. Section 5.4 does the same for regression. Section
5.5 provides comparative results with state-of-the-art algorithms. Conclusions are
presented in Section 5.6.
122
5.2 Learning with Genetic Algorithms
We start with a collection of p kernel matrices generated at random, {K1 , . . . , Kp }.
The current population is iteratively modified using a genetic-based algorithm until
convergence.
5.2.1 Feature representation
Genetic Algorithms (GA) constitute a set of tools that are well suited for solving
mathematical optimization problems in large spaces where there are multiple local
minima and no clear indication of how to find them [36]. This is especially practical
when the search space is so vast that, despite computational improvements, one would
require years (and potentially centuries) to solve the problems if a reasonable area of
the search space were to be explored.
In GA, we start with a set of genomes, each representing an individual. This
set of individuals is called the population. The first key step in GA is to define an
appropriate coding of the problem data as a genome. The most typical coding is
a feature vector with each element defining one of the parameters (or features or
variables) that play a role in our optimization problem. In this representation, each
entry in the feature vector codes for a directly relevant variable in the optimization
problem, Fig. 5.1(a).
In contrast to the classical coding approach described in the preceding paragraph,
we include non-coding segments in the feature vector (i.e., genome). As any biological
systems, the coding and non-coding segments alternate one another, Fig. 5.1(b). The
coding segments will be referred as genes (because they code for the kernel matrix
K which is our end result or outcome). This emulates the coding seen in actual
123
(a)
(b)
Figure 5.1: (a) The classical feature representation. Each entry in the feature vector codes
for a relevant variable in the optimization problem. (b) The proposed feature representation.
Each individual in the population is represented as a feature vector with coding and non-
coding segments. The lower case letters represent the coding (or gene) sequence used for
the calculation of the fitness function. Consecutive N labels indicate non-coding DNA.
biological systems. The elements defining the gene sequences are obtained from the
elements of L, with K = LT L, where K is a kernel matrix, whereas the values of
the non-coding DNA sequences are generated at random. Each gene is preceded by a
fixed sequence (or gene marker). This specifies where each of the genes starts in the
genome. This is the typical approach used by cells in biology.
To reconstruct a kernel matrix from an individual (genome), we work as follows.
First we identify the positions of the gene markers, indicating where each coding DNA
sequence starts. Since the genes are of a specified length, they can be easily read,
concatenated and reshaped back to L. The kernel matrix K is then given by LT L.
124
The genome representation defined in this section will allow us to derive novel
operators, such as transposition, deletion and insertion. This is so because we can now
make use of the non-coding sections of the genome to address some of the limitations
of earlier operators. We discuss this in the sections to follow.
5.2.2 Basic operators
Most GA use two major operators crossover and mutations. In crossover, two
[t] [t]
individuals, ui and uj , of the current population (i.e., two kernel matrices in our
[t]
case) are selected at random. Here, ui = (ui1 , . . . , uiq )T Rq , t specifies the iter-
ation or population cycle, and i, j {1, . . . , p}, p the number of individuals in the
[t+1]
population. An integer r [1, q] is selected at random. Two offsprings ui and
[t+1]
uj (i.e., two individuals of the new generation) are obtained as
[t+1]
ui = (ui1 , . . . , uir , ujr+1, . . . , ujq )T
[t+1]
uj = (uj1 , . . . , ujr , uir+1, . . . , uiq )T . (5.1)
By combining two existing (good) solutions, we construct alternative kernel matrices
form a distant area of the search space which could yield even higher classification
rates. While one of the matrices (say, ui ) helps classify samples in a region of the
feature space, the other matrix could be instrumental in the classification of the
samples in the rest of the feature space.
The mutation procedure is meant to add random jumps within the search space
which are unlikely to occur with crossover or gradient descent techniques. Some
mutations will add small changes, with the aim to jump over a local minimum. Other
mutations will add large changes, moving the search to a completely different region
of the search space. The mutation operation works as follows. An individual from the
125
Figure 5.2: This figure illustrates the copy-and-paste transposition.
[t]
current population is selected at random, uk . A number s of its entries are randomly
[t]
selected, with s = qpm ; pm the mutation rate. Each of these entries uk (li ) is
replaced by a random number bi as follows,
[t]
uk (li ) = bi , li M, i = 1, ..., s. (5.2)
where M is the set containing the indices of the s selected entries. The mutation
value used in the above equation is bounded by the minimum and maximum of all
[t]
the entries of uk .
5.2.3 Transposition
While crossover and mutation are typically used in GA, nature makes use of a
large variety of tools to modify individuals in a population [81]. Here, we present
mathematical models of three of these transposition, deletion and insertion.
As summarized earlier, transposition refers to chunks of DNA that move from
one location to another within the genome. In our search space, transposition would
apply a local norm (or classifier) to a different region of the feature space. A norm
that does not work well in one area of the space, may be what is needed in another.
126
We model two major transposition mechanisms. The first one is called copy-and-
paste. Here, a short sequence of DNA is copied to RNA by transcription, and then
copied back into (inserted as) DNA by reverse transcription at a new position. This is
illustrated in Figure 5.2. Due to transcription noise, the copied sequence may diverge
slightly from its former self. To model this, let v be a transposon, v = (v1 , ..., vLt )T ,
where Lt its length. And, assume each entry of v is perturbed by a small Gaussian
noise with a probability pv , i.e.,
vi = vi + szi , i P,
where vi is the entry after perturbation, s is the scale of the Gaussian noise, zi
N (0, 1), and P is the set containing the indices of the perturbed entries. Suppose a
genome u is selected and the insertion position is t, after copy-and-paste this becomes
u = (u1 , ..., ut , vi1

, ..., viLt
, ut+1 , ..., uq )T . (5.3)
The second transposition mechanism we will model is called cut-and-paste. In
this case, a sequence of DNA is cut from its original position and inserted into a new
position of the same genome, Figure 5.3. Since this process does not involve an RNA
intermediate, it is not affected by noise. Formally, denote the cut position t0 (with
t0 < t). Using the same notation above, we define the new individual u as
u = (u1 , ..., ut01 , ut0+Lt , ..., ut , vi1 , ..., viLt , ut+1 , ..., uq )T . (5.4)
The two transposition procedures described above work as follows. First, individ-
uals are selected at random at a transposition rate pt . A transposition location is
selected from a random location in the genome and used in either copy-and-paste or
cut-and-paste (at 50% each). Finally the transposon is inserted into a randomly cho-
sen position. Note that in the copy-and-paste mechanism, the length of the genome
127
Figure 5.3: This figure illustrates the cut-and-paste transposition.
is increased. This would not admissible if we were using the classical feature repre-
sentation, but is not an issue when we employ the coding-non-coding model defined
in the preceding section.
5.2.4 Deletion and insertion
Next, we propose a mathematical model of deletions. In genetics, a deletion is a
type of genetic aberration in which a sequence of DNA of a genome is missing. Any
number of nucleotides can be deleted, from a single base pair to an entire piece of a
genome. In nature, deletion is generally harmful, but, in some occasions, can lead to
advantageous variations.
To model this process, we work as follows. An individual u is selected with prob-
ability (or, deletion rate) pd . A DNA segment v of u, starting at a random position
t, is chosen for deletion. More formally, denote u = (u1 , ..., uq )T , v = (v1 , ..., vLd )T ,
u is the genome after deletion. Then,
u = (u1 , ..., ut1 , ut+Ld , ..., uq )T . (5.5)
The deletion length Ld is a random variable and is modeled as Ld N (Ld , L2 d ),
where Ld and L2 d are the mean and variance of Ld , respectively.
128
(a)
(b)
Figure 5.4: This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is formed.
Note that the length of the genome is hence decreased. Since the position of
deletion is chosen at random, it is possible that only a sequence of non-coding DNA
is deleted, Figure 5.4(a). It is also possible to delete a coding part. In this latter case,
the non-coding DNA right after the deleted segment becomes the coding segment,
Figure 5.4(b).
Deletion can eliminate a local norm (or classifier) that was causing problems and
substitute this for a randomly initialized alternative that can be improved with the
other optimization tools. This procedure can be especially useful for leaving large
(close-to) flat areas of the optimization function.
Our final operator models insertions. In genetics, an insertion is a type of genetic
aberration in which a DNA sequence is inserted into a genome. A common cause
of insertions is viral infections, where viruses integrate their genome into that of the
129
host cell. The effect of insertion depends greatly on the location within the hosts
genome.
To model insertions, we define a population of viruses Q = {q1 , ..., qr }, where qi
is a Lv 1 vector, and r is the size of the population. The virus population is allowed
to evolve with the mutation operator from generation to generation.
Genomes u are selected at insertion rate pi . A virus qj is selected at random from
Q. A position t in u is randomly chosen and qj is inserted to u at t. The resulting
individual (after insertion) is given by
u = (u1 , ..., ut , qj1 , ..., qjLv , ut+1 , ..., uq )T . (5.6)
5.2.5 Selection criterion
The operators described above are used to generate d > p individuals. The number
of offsprings d is usually twice or larger than p. A selection criterion is then employed
to determine the best fitted p individuals that will survive and thus become the
member of the population at time t + 1.
The process starts with a population of p individuals generated at random and
from pre-specified kernel functions. This process combines the characteristics of differ-
ent kernel functions and introduces much needed randomness to the initial population.
[0]
The initial population set is formally defined as {K1 , . . . , K[0]
p }.
A selection criterion is then used to determine the most fitted individuals that
are to survive to the next iteration. Since our goal is classification accuracy, we
employ the Bayes accuracy criterion of [114], which is the one minus the Bayes error
as calculated in the kernel space.
130
More formally, let X = {x11 , . . . , x1n1 , . . . , xCnC } be a given training set, where
xij is the j th sample in class i, ni is the number of samples in class i and C is the total
number of classes. Let (.) : Rl F be a function defining the kernel map, where l
is the dimension of the input space. We assume that data has been whitened in the
kernel space, and denote K as the whitened kernel matrix for the training samples,
i.e., K = (X)T (X), where (X) = ((x11 ), . . . , (xini ), ..., (xCnC )). Then, the
kernel Bayes accuracy (KBA) criterion is given by

C1
X C
X
J() = pi pj w(
ij )tr(Sij ), (5.7)
i=1 j=i+1
where pi is the prior of class i,

ij is the Mahalanobis distance in the kernel space,
q
defined as
ij = 1Ti Kii1i 21Ti Kij 1j + 1Tj Kjj 1j , Kij = (Xi )T (Xj ) is the subset
of the kernel matrix for the samples in class i and j, (Xi ) = ((xi1 ), . . . , (xini )), 1i
is a ni 1 vector with all elements equal to 1/ni , w(.) is a weighting function, with
1 Rx 2
w( 2 et dt is the error function, and Sij is the
ij ) = 2
2 erf ( 2ij2 ), erf (x) = 0
ij
kernelized between-class scatter matrix, with Sij = (Ki 1i Kj 1j )(Ki 1i Kj 1j )T ,
Ki = (X)T (Xi ) the subset of the kernel matrix for the samples in class i.
Optimizing (5.7) yields a kernel matrix K corresponding to a kernel representa-
tion where the Bayes error is minimized. This is given by
K = arg max J(K). (5.8)

K
The kernel Bayes accuracy criterion defined in (5.7) is used to evaluate the fitness
of these d offsprings, i.e., gi = J(Ki ), i = 1, .., d, where gi is the fitness value of the
ith genome. Then, the individuals that will form the new population are selected as
follows. First, an elitist selection strategy is applied. This means that the pf best
fitted individuals are kept. Another set of pn is randomly selected from the bottom
131
10%, i.e., the less fitted individuals. The values of pf and pn are selected to be
approximately 5% of p. The first group is used to guarantee fast convergence. The
second group is used to maintain diversity in the population, which may help us jump
away from local minima in the future. The rest of the individuals p pf pn are
selected at random using a roulette wheel rule [36]. In the roulette wheel rule, the
probability of selecting the ith individual is given by
gi
pi = Pdpf pn , (5.9)
i=1 gi
where pi is the probability with which the ith individual is selected.
The procedure described in this section is iterated until convergence. Convergence
is given by
[t+1] [t]
|gm gm | < , (5.10)
[t]
where gm is the maximum fitness value of the population at iteration t, and > 0 is
small.
To avoid problems with random initialization, we run the proposed approach
multiple times with different initial populations, then keep the solution with the
best fitted individual of the final populations. The proposed kernel matrix learning
algorithm is summarized in Algorithm 5.1.
5.3 Generalizing to Test Samples in Classification
Once we learn the optimal kernel matrix of the training data from Algorithm 5.1,
we can use it in any kernel-based approaches such as KDA [67, 5], Kernel Subclass
DA (KSDA) [113] and SVM [92]. However, the only information we have is the
kernel matrix for the training data and we do not know the corresponding explicit
132
Algorithm 5.1 Kernel Matrix Learning with GA
Input: Training set, xi , ..., xn ,
Output: kernel matrix K
for i = 1 to a do
Generate initial population with K1 , ..., Kp .
repeat
1. Generate new individuals with operators from (5.1) to (5.6).
2. Calculate the fitness values gi of the new individual using (5.7).
3. Select survivors using (5.9).
[t+1] [t]
until |gm gm |<
Output: The most fitted individual, K(i).
end for
K = maxi J(K(i))
Return: K
kernel function to construct the kernel values which measure the similarity between
the training and testing samples.
This is a general problem in kernel matrix learning. A common solution is to
cast the classification problem as a transductive one. Given the labeled training and
unlabeled test samples, one generates a common kernel matrix including the two sets.
The kernel matrix is learned using the available approach such as the one defined
in this chapter. This means we need to relearn the kernel matrix each time a new
testing sample becomes available. One could say that the learned mapping does not
generalize to new samples.
In the present section, we propose a novel solution to the above defined problem.
The idea is to estimate the underlying function represented by the learned kernel
matrix using regression. Formally, given X, i.e., a set of n training samples with
known predictor vectors yi = (hx1 , xi i, . . . , hxn , xi i)T Rn , where hxi , xj i is
the (i, j)th entry in the learned kernel matrix, we want to find the function f(.)
133
providing the best estimate of the true (but unknown) underlying function, where
f(x) = (f1 (x), ..., fn (x))T , and fi (.) : Rl Rn is the ith regression function.
Let fi (x) = kxi (x) = hxi , xi. To learn this underlying function, we need to use
a non-linear approach. Kernel Ridge Regression (KRR) [42] provides the necessary
flexibility and computational efficiency for this task. KRR minimizes the cost function
n
1X T
L(W ) = kyi W (xi )k22 + kW k2F , (5.11)
n i=1
where (.) is a function defining the kernel mapping, W is a projection matrix in
the kernel space, is a regularization parameter, k k2 denotes the Euclidean norm
of a vector and k kF is the Frobenius norm of a matrix.
The solution of the regressed function is given by
f(x) = Y(G + In )1 g(x), (5.12)
where Y = (y1 , ..., yn ) is an n n predictor matrix, G is the Gram matrix with its
(i, j)th entry defined as Gij = g(xi , xj ) for some known kernel function g, In is the
n n identity matrix, and g(x) = (g(x1 , x), ..., g(xn , x))T .
When a test sample z is to be classified, the corresponding prediction vector
containing all the kernel values can be easily computed as
f(z) = (f1 (z), ..., fn (z))T , (5.13)
and, can thus be readily used in any kernel-based approach.
5.4 Kernel Matrix Learning in Regression
Our kernel matrix learning approach is a generic approach, since the learned
kernel matrix could be plugged into any kernel-based methods in the settings such as
134
classification, regression and clustering, provided that an appropriate fitness selection
criterion in GA is given. In this section, we extend our kernel learning framework to
the regression problem to further demonstrate its utility.
For illustration we employ (KRR) [42], since it is commonly used in many applica-
tions. There are two types of parameters in KRR to be learned, the kernel matrix K
and the regularization parameter . We use the proposed GA-based approach defined
in the present work to jointly learn K and . The generalized cross-validation (GCV)
[96], is extended to serve as the selection criterion in our GA.
GCV is used for selecting in ridge regression, and can be formally written as
nk(In H())yk22
GCV () = ,
(tr (In H ()))2
where H() is the hat matrix which projects the label y to the corresponding predicted
label y, i.e.,
y = H()y.
In KRR, the predicted labels y for the training data can be obtained by
y = K(K + In )1 y = H(K, )y,
We can thus optimize both K and by minimizing
nk(In H(K, ))yk22

GCV (K, ) = , (5.14)
(tr (In H (K, )))2
In order to jointly learn both K and , the value of is added at the end of each
genome (i.e., as a new allele). In such a way, the GA operations and selection do
not need to be modified.
135
1
0.5
0.98
classification accuracy
0.96
0
0.94
0.92
0 5 10 15 20 25 30
0.5 generations
0.6 0 0.6
(a) (b)
Figure 5.5: (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the classification
accuracy over the number of generations.
5.5.1 A toy example
We first present a toy example to illustrate how the kernel matrix evolves and im-
proves during the generations using our genetic-based algorithm. We consider a XOR
data classification problem, Fig. 5.5(a). The data set contains two classes, and each
class distribution is represented by a mixture of two Gaussians. An independent test-
ing set from the same class distributions is generated to test the proposed approach.
Fig. 5.5(b) demonstrates the classification accuracy as the number of generations
increases. We see that the the classification accuracy gradually improves during the
generations and the algorithm converges in a few iterations.
We then illustrate how the learned kernel matrix evolves in Figure 5.6(a)-(f). In
the beginning, we observe that sample similarity diverges a lot within each class,
136
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
(a) 1st generation (b) 2nd generation
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
(c) 4th generation (d) 8th generation
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
(e) 20th generation (f) 30th generation
Figure 5.6: In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations.
137
which is due to the fact that the distance between the two clusters of the same class
is much larger than that within the same cluster. This means that the Euclidean
distance measure in the original space cannot capture the underlying sample similarity
within each class well. A good kernel matrix should indicate that the within-class
similarity is much larger than the between-class similarity. We can see in Figure 5.6
that using our algorithm, the within-class similarity gradually increases as the kernel
matrix evolves. This implies that our learned kernel matrix could induce a kernel
space where samples in the same class are as close as possible whereas samples in the
different classes are as far apart from each other as possible, leading to a much easier
classification problem.
To further evaluate how the kernel matrix is optimized during the generations, we
adopt the kernel alignment [18] to measure how close a learned kernel matrix to an
ideal kernel matrix K0 , with K0 (xi , xj ) = 1 if yi = yj , and 0 otherwise, where yi is
the class label of xi . The kernel alignment between two K and K0 is defined as
hK, K0iF
A(K, K0 ) = q ,
hK, KiF hK0 , K0 iF
where h., .iF is the Frobenius norm between two matrices and defined as hK1 , K2 iF =
P P
i j K1 (xi , xj )K2 (xi , xj ). The higher the kernel alignment is, the more similar the
two kernel matrices are. The kernel alignment between the learned kernel matrix and
the ideal one is shown in Figure 5.7. We see that the learned kernel matrix gets closer
to the ideal kernel matrix as we have more generations.
138
0.75
0.7
0.65
alignment
0.6
0.55
0.5
0.45
0 5 10 15 20 25 30
generations
Figure 5.7: This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations.
5.5.2 Classification algorithms
We employed the derived approach to learn the kernel mapping of two popular
algorithms KDA and SVM. We present comparative evaluations using a variety of
data-sets.
In KDA, the results are compared to kernel selection with CV, the Fisher criterion
of [108] and KBA [114]. These are denoted KDACV , KDAF , KDAK , respectively. The
nearest mean classifier is used in each of the corresponding subspaces. We choose this
classifier because it is Bayes optimal if the data in the kernel space is linearly sepa-
rable. We provide comparative results using transductive learning, denoted KDAT .
We also compare the proposed optimization approach to the traditional GA with
crossover and mutation operators only, denoted KDAT R .
139
In SVM, our results are compared to those obtained with CV, transductive learn-
ing, and traditional GA, SVMCV , SVMT and SVMT R . We also provide a comparison
with the multiple kernel learning algorithm of [4]. This algorithm applies sequen-
tial minimal optimization techniques (required in large-scale implementations) to a
smoothed version of a convex Moreau-Yosida optimization problem. We denote this
algorithm Support Kernel Machine (SKM). We also use this learned kernel matrix in
KDA and denote it KDAS . As a baseline, we also compare to the algorithm where
the kernel matrix is constructed from a uniform combination of different kernels. We
denote this algorithm to be KDAU and SVMU . For all the algorithms using a single
kernel function, the RBF kernel is used. For those algorithms where the parameters
are selected by CV, a 5-fold CV is conducted.
In order to demonstrate the effectiveness of the proposed regression-based gen-
eralization approach of Section 5.3, we compare it with a recently proposed semi-
supervised kernel matrix learning approach called kernel propagation (KP) [45]. In
this approach, the full kernel matrix is constructed from a seed-kernel matrix by max-
imizing the smoothness of the mapping over the data graph. The parameter of the
heat kernel used in calculating the affinity matrix is set as the averaged Euclidean dis-
tance from each data point to its ten nearest neighbors [45]. We denote this method
as KDAKP and SVMKP , respectively.
The initial population includes 30 individuals. We use random initialization and
a variety of commonly used kernels: RBF, polynomial, sigmoidal and Laplacian. A
typical range is given for the parameters of each kernel. The parameter for RBF
kernel is in [m1 2t1 , m1 + 2t1 ], where m1 and t1 are the mean and standard deviation
of the pairwise sample squared distances; the parameter for Laplacian kernel is in
140
Table 5.1: The parameters used in the experiments
Parameter value Description
pc 0.8 crossover rate
pm 0.05 mutation rate
pf 0.05 percentage of the best fitted individuals
kept
pn 0.03 percentage of the least fitted individu-
als kept
Lc 4 length of each gene
Lnc 10 length of each non-coding sequence
Lt 3 length of transposon
pt 0.02 transposition rate
s 0.01 scale of the Gaussian noise in transpo-
sition
pv 0.01 perturbation rate for each entry in
transposon
pd 0.01 deletion rate
Ld 3 mean of the deletion length
L2 d 4 variance of the deletion length
pi 0.1 insertion rate
Lv 5 length of each virus
r 6 size of the virus population
[m2 2t2 , m2 + 2t2 ], where m2 and t2 are the mean and standard deviation of the
pairwise sample distances; in the polynomial kernel, the degree is in [1, 5]; in the
sigmoidal kernel, k(x, y) = tanh(axT y + r), a = p1 , p is the dimensionality of the
data, and r is in [0, 1]. All kernels are aligned, i.e.,
hxi , xj i
hxi , xj i = q . (5.15)
hxi , xi ihxj , xj i
The parameter setup in our experiments is shown in Table 5.1. KRR is used to
train the embedding function, and the RBF kernel is used in KRR.
141
Table 5.2: KDA Recognition rates (in percentages) in the UCI data-sets.
Data-set KDAGA KDAT R KDAKP KDAT KDAU KDAS KDAK KDAF KDACV
Breast C. 76.4(2.9) 72.3(2.4) 70.4(3.9) 69.1(4.1) 65.8(3.7) 62.6(4.6) 68.4(3.3) 64.4(2.1) 66.6(5.2)
Ionosphere 93.4(1.3) 92.3(2.4) 87.2(2.3) 85.1(2.4) 94.6(2.0) 74.6(5.9) 80.6(4.1) 80.6(4.1) 86.6(1.2)
Liver 80.6(3.8) 76.8(3.1) 66.0(6.1) 66.4(5.8) 66.4(5.7) 69.9(7.1) 65.5(1.6) 65.8(4.4) 73.3(5.4)
Monk 1 94.0(3.7) 90.0(4.1) 86.2(4.0) 77.3(8.0) 84.7(4.5) 82.0(5.1) 85.3(5.1) 86.7(4.7) 84.0(6.0)
Monk 2 96.0(5.5) 95.3(5.1) 90.6(2.4) 92.7(4.4) 94.0(4.4) 93.3(5.3) 93.3(5.3) 90.7(4.4) 90.3(5.3)
Pima 78.4(1.6) 76.4(1.4) 70.6(4.9) 69.5(5.0) 71.6(3.5) 70.7(4.3) 71.2(3.5) 70.4(3.5) 72.5(2.5)
Table 5.3: SVM Recognition rates (%) in the UCI data-sets.
Data-set SVMGA SVMT R SVMKP SVMT SVMU SKM SVMK SVMCV

Breast C. 78.2(2.9) 74.2(3.7) 70.2(3.1) 66.5(8.8) 67.3(4.3) 62.2(5.1) 69.5(2.7) 70.2(2.1)
Ionosphere 96.9(1.2) 95.7(0.1) 92.6(3.0) 94.6(2.7) 94.3(2.0) 94.0(1.2) 93.4(2.1) 92.8(2.5)
Liver 80.6(1.9) 76.5(3.5) 68.9(5.6) 60.3(7.0) 71.3(6.8) 72.8(5.6) 71.6(5.5) 74.5(4.9)
Monk 1 95.3(3.8) 89.3(4.3) 80.5(6.8) 81.3(5.1) 86.0(2.8) 88.7(5.6) 89.3(4.3) 84.7(6.5)
Monk 2 97.3(2.8) 95.3(5.1) 93.0(4.0) 93.3(4.1) 93.3(5.8) 88.7(6.9) 94.0(4.9) 93.3(6.2)
Pima 79.2(1.5) 77.0(1.6) 69.2(4.0) 69.2(4.3) 73.2(2.3) 73.2(2.3) 72.6(2.4) 74.9(1.9)
142
5.5.3 UCI Repository
We apply the kernel learning approaches defined in this section to six data-sets
from the UCI repository [7]. In the Breast Cancer data-set, the task is to discriminate
two classes: no-recurrence and recurrence. The Ionosphere set is for the satellite
imaging detection of two classes (the presence or absence of structure) in the ground.
In the BUPA liver disorders set, a blood test with parameters are used to detect liver
disfunction. The goal of the Monk problem is to distinguish two distinct postures
of a robot. Monk 1 and 2 denote two alternative scenarios. Finally, the NIH Pima
data-set is used to detect diabetes from eight measurements.
For each data-set, we created five random partitions of the data, each with 80%
of the samples for training and the rest 20% for testing. The successful classification
rates on the above data-sets are shown in Tables 5.2 and 5.3. Both mean and stan-
dard deviation (in parentheses) are reported A paired t-test is used to check statistical
significance. The classification rate in bold is significantly higher than the others at
significance level 0.05. The proposed approach outperforms the other kernel learning
algorithms. The comparison of the proposed regression-based inductive learning ver-
sus the typical transductive alternative is also favorable to the proposed approach.
In addition, our approach does not need to re-estimate the kernel matrix every time
a previously unseen test sample is to be classified. Additionally, the approach de-
scribed in the present paper defines a smaller kernel matrix, with smaller memory
requirements.
We also report the training time in kernel matrix learning for each algorithm in
Table 5.4. Since no training is needed in the algorithm of uniform kernel combination,
we do not include this algorithm in the comparison. From Table 5.4, we first observe
143
Table 5.4: Average training time (in seconds) of each algorithm in the UCI data-sets.
Data-set GAOurs GAT SKM CV KBA

Breast C. 330.6 373.8 78.3 6.6 0.5
Ionosphere 339.1 737.2 39.2 13.5 2.0
Liver 409.7 1071.7 126.2 50.3 5.7
Monk 1 275.5 311.9 41.1 1.3 0.3
Monk 2 47.7 75.9 20.2 1.5 1.9
Pima 3095.8 4762.9 2681.2 66.5 10.9
that all the algorithms with multiple kernels need more training time than those with
single kernel. As we discussed before, the transductive learning is computationally
expensive and slower than our algorithm. Yet, our algorithm is slower than SKM in
these binary classification data-sets. However, we will see later that SKM becomes
much more time consuming when multi-class classification is performed.
A general question in GA-based approaches is to know how fast the algorithm
converges. This is, of course, problem specific. Figure 5.8(a) and (b) plot the classi-
fication accuracy as a function of iterations for two of the databases used above. To
obtain these plots, we executed our approach 50 times. The figures show the mean
and standard deviation. We observed a rapid convergence on the data-sets used.
Another interesting question is how well the proposed optimization approach com-
pares to the traditional GA algorithm with crossover and mutation only. Moreover,
how do the proposed advanced GA operators help to improve the kernel matrix? To
see this, we present additional plots with the traditional GA algorithm and each of
the proposed operators only. First, in Figure 5.8(b) and (g), we see that the tra-
ditional GA algorithm can improve the classification accuracy as the kernel matrix
144
0.96 0.96
0.94 0.94
0.92 0.92
0.9 0.9
0.88 0.88
0.86 0.86
0.84 0.84
0 10 20 30 40 50 0 10 20 30 40 50
generations generations
(a) (b)
0.94 0.94 0.95
0.93 0.93
0.94
0.92
0.92
0.93
0.91
0.91
0.9 0.92
0.9
0.89
0.91
0.89
0.88
0.88 0.9
0.87
0.87 0.86 0.89

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
generations generations generations
(c) (d) (e)

0.76
0.76
0.74
0.74
0.72 0.72
0.7 0.7
0.68 0.68
0.66 0.66
0 10 20 30 40 50 0 10 20 30 40 50
generations generations
(f) (g)
0.72 0.72 0.72
0.71 0.71 0.71

0.7 0.7 0.7
0.69 0.69 0.69
0.68 0.68 0.68
0.67 0.67 0.67
0.66 0.66 0.66

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
generations generations generations
(h) (i) (j)
Figure 5.8: Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization approaches applied
to KDA using monk1 database, and the plots from (f) to (j) were obtained with different
optimization approaches applied to SVM using breast cancer database. (a) and (f) show
the proposed genetic-based optimization approach. (b) and (g) show the traditional GA
algorithm with crossover and mutation only.145(c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only. (e) and (j) show
GA algorithm with insertion operator only.
Table 5.5: KDA Recognition rates (%) for large data-sets.
KDAGA KDAT R KDAKP KDAT KDAU KDAS KDAK KDAF KDACV

PIE10 78.3(1.4) 73.2(2.2) 70.6(2.0) 72.8(1.7) 74.4(1.7) 75.6(1.0) 61.0(2.0) 59.7(1.9) 64.5(1.6)
PIE20 90.8(1.0) 87.3(2.0) 86.5(0.9) 87.5(0.8) 88.4(0.7) 88.9(0.9) 86.9(0.9) 86.3(0.9) 86.8(1.6)
PIE30 93.9(0.6) 93.6(0.7) 89.6(1.1) 90.8(0.7) 93.4(1.0) 92.1(0.8) 93.3(0.8) 92.9(0.8) 92.5(0.7)
SPDM 85.7(0.8) 84.0(0.8) 82.6(0.9) 83.5(1.0) 83.6(0.9) 83.6(1.0) 83.8(0.9) 83.9(1.0) 84.0(1.2)
evolves. However, the final accuracies it obtains are lower than those obtained by the
proposed optimization approach. This means that the proposed additional operators
could further facilitate the optimization process and improve the classification per-
formance. From Figure 5.8(c)-(e) and (h)-(j), we see that each of the proposed new
operators can help optimizing the kernel matrix to improve the classification accu-
racy to some extent. For the same data-set, one operator may perform better than
the other, e.g., Figure 5.8(g) and (d). Some operator works better in one data-set
than another data-set, e.g., Figure 5.8(e) and (j). It is the combination of all of these
operators that makes our approach more effective in classification.
5.5.4 Large databases
Our next experiment is on the PIE data-set of face images [86]. Here, the task
is to classify faces according to the identity of the individual shown in the image.
All face images were aligned with regard to the main facial features and resized to a
standard size of 3232 pixels, as in [60]. The results are in Table 5.5 and 5.6. In these
tables, N specifies the number of images per class used to train the kernel matrix.
146
Table 5.6: SVM Recognition rates (%) for large data-sets.
Data-set SVMGA SVMT R SVMKP SVMT SVMU SKM SVMK SVMCV

PIE10 75.6(1.0) 71.3(2.7) 71.5(1.6) 72.3(1.2) 62.5(2.0) 73.5(1.0) 53.7(2.3) 67.3(2.3)
PIE20 87.3(0.4) 82.7(0.8) 83.4(0.6) 86.1(0.4) 83.0(0.4) 85.1(0.7) 80.8(0.6) 86.0(0.4)
PIE30 92.0(0.8) 91.5(0.8) 89.6(1.0) 90.7(0.6) 90.5(1.4) 90.9(1.0) 90.8(1.0) 92.0(0.8)
SPDM 85.9(0.7) 85.0(1.7) 80.5(0.4) 82.3(1.0) 84.6(1.2) 85.2(0.9) 85.5(0.9) 85.1(1.0)
Table 5.7: Average training time (in seconds) of each algorithm in large data-sets.
Data-set GAOurs GAT SKM CV KBA

PIE10 4.7 104 1.0 105 3.6 105 1.1 103 2.2 102
PIE20 6.5 104 2.2 105 6.2 105 4.0 103 6.4 102
PIE30 1.2 105 5.4 105 8.7 105 7.5 103 1.2 103
SPDM 9.1 104 2.3 105 5.6 104 2.5 103 2.7 102
147
The results are averaged by five random trials. As above, the proposed approach
outperforms the others.
We also used the Sitting Posture Distribution Maps (SPDM) data-set of [117]. In
this data-set, samples were collected using a chair equipped with a pressure sensor
sheet located on the sit-pan and back-rest of a chair. A total of 1,280 pressure values
from 50 individuals are provided from the pressure maps. There are five samples of
each of the ten different postures per individual. The goal is to classify each of the
samples into one of the ten sitting postures. We randomly selected 3 samples of each
posture and each individual for training, and used the rest for testing. The results
are then averaged by five trials. The results are shown in Table 5.5 and 5.6. The
proposed approach performs better than the other.
We report the average training time of each algorithm in Table 5.7. We again
see that our algorithm is faster than transductive learning. Moreover, in this case,
our algorithm is also faster than SKM. This is because SKM can only learns a kernel
matrix with two classes. When there are multiple classes, we have to use one versus
one mechanism to extend to multi-class cases. Thus, the training time greatly depends
on the total number of classes. The more classes there are, the more training time
it takes. Whereas our algorithm can directly deal with the multi-class case, thus is
more efficient.
5.5.5 Discussions of the genetic operators
In this section, we give a detailed discussion of how the genetic operators help
to optimize the kernel matrix. First, note that each genome u on which the genetic
operators are directly applied is formed by concatenating all the entries of a matrix L,
148
where K = LT L, and K is the kernel matrix to be learned. Denote L = (l1 , l2 , ..., ln ),
and li is a n 1 vector, where n is the number of training samples. Then

lT1 l1 . . . lT1 ln
. .. ..
..
K= . . .
T T
ln l1 . . . ln ln
This means that K(xi , xj ) = lTi lj . Thus, the changes in genome u will result in the
corresponding changes in the entries of the kernel matrix K. Now that we have this
interpretation, we can discuss how each genetic operator works to improve the kernel
matrix.
In crossover, two offsprings are obtained by combining two existing solutions as
in (5.1). For ease of discussion, suppose the crossover position r is a multiple of n,

[t+1]
i.e., r = mn, where m is an integer. After crossover, one of the offspring ui , is
[t+1] [t+1]
reshaped to form a new matrix, Li , with Li = (li1 , ..., lim , ljm+1 , ..., ljn ), where lik
[t]
is the k th column of Li . Then the corresponding kernel matrix is reconstructed by
[t+1] [t+1]T [t+1]
Ki = Li Li , which is

li1 ljm+1
T T T T
li li . . . li1 lim . . . li1 ljn
1. 1 .. .. ..
.. .. ..
. . . . .

T iT i iT j iT j
[t+1] lim li1 . . . lm lm lm lm+1 . . . lm ln
Ki = T jT jT T .

ljm+1 li1 . . . lm+1 lm lm+1 ljm+1
i
. . . ljm+1 ljn

.. .. .. .. .. ..
. . . . . .

T jT i jT j jT j
ljn li1 . . . ln lm ln lm+1 . . . ln ln
[t+1] [t] [t] [t]
Comparing Ki with Ki , we see that a submatrix of Ki , i.e., {Kij }i=m+1,...,n,j=1,...,n
has been replaced (note that since the kernel matrix is symmetric, only lower off
diagonal elements are considered). This submatrix corresponds to the classification

[t] [t]
of samples in one particular region of the feature space. Given Ki and Kj , if
[t] [t]
the corresponding submatrix in Kj can do better classification than that in Ki ,
149
[t+1]
then after crossover, the offspring Ki can improve the classification in the region
[t]
represented by this submatrix in Kj .
In the insertion operator, a random sequence is inserted to a randomly selected
position of a genome. Note that our feature representation incorporates the non-
coding sequences into the genome, allowing a flexible length of the genome. Thus,
the insertion of a sequence corresponds to a local change of the kernel matrix. More
formally, suppose that the insertion of the sequence causes a change of a vector lq
[t]
in Li , then this will result in a change of the corresponding row and column in the
[t+1]
kernel matrix Ki , i.e.,

lT1 lq

..

.
[t+1]
Ki = lTq l1 . . . lTq lq . . . lTq ln . (5.16)

..
.
lTn lq
This change will affect the similarity between the q th sample and all the other samples
in the data. As a result, the local classification function is changed by insertion, which
could help to resolve the misclassification in a local region of the feature space.
In the deletion operator, a random sequence is deleted from a random position
of the genome, which leads to a corresponding local change of the kernel matrix. If
we again suppose the deletion will cause a change of a vector lq , similar to insertion
operator, this will result in a change of the corresponding row and column in the
kernel matrix, as in (5.16). By deletion, the local classification function is rearranged
such that the classification in a section of the feature space could be improved.
In the copy-and-paste in transition, a sequence of genome is copied and inserted
to a new position in the same genome. Suppose the transposon is from lp and copied
to lq . This will cause the change of the q th column and row of the kernel matrix. This
150
implies that a local classification function with good performance is now applied to
a new region in the feature space. If this improves classification, then the new kernel
matrix will be selected.
In the cut-and-paste in transition, a sequence of genome is removed and inserted
to a new position in the same genome. Again, suppose the transposon is from lp and
moved to lq . This will cause the change of both pth and q th column and row of the
kernel matrix. This implies that a local classification function that does not work well
in one section of the feature space will now applied to a new section of the feature
space. If this improves classification, then the new kernel matrix will be selected over
the old one.
5.5.6 Application to regression
We select 7 data-sets from the UCI machine learning [7] and the DELVE collections
[29].
In the Boston housing data-set, the task is to predict the median value of a home
price. The auto mpg set details fuel consumption predicted in terms of 3 discrete
and 4 continuous attributes. In the Normtemp set, the goal is to predict the heart
rate based on gender and body temperature of 130 people. The Airport set requires
prediction of the enplaned revenue in tons of mail. The task in the Puma-8nm is
to predict angular accreditation from a simulation of the dynamics of a robot arm.
And, the Kin problem requires us to predict the distance of the end-effector from a
target in a simulation of the forward dynamics of an 8 link all-revolute robot arm.
Two cases with moderate and high amount of noise are considered, denoted Kin-8nm
and Kin-8nh.
151
For the first four data-sets, we randomly select 90% of the samples for training,
and use the rest for testing. This is repeated 10 times and the mean and standard
deviation of the errors are reported. The remaining databases have a larger number
of samples, allowing a random split into disjoint subsets. The first 1,024 samples
in each subset are used for training, while the others form the testing set. Again,
we report the mean and standard deviation of the errors of four splits. We use the
root mean squared error (RMSE) as our measure of the deviation between the true
Pn 1/2
response yi and the predicted response yi , i.e., RMSE = [n1 i=1 (yi yi )2 ] .
We compare the proposed approach with two state-of-the-art regression methods,
KRR and Support Vector Regression (SVR). In KRR, the parameters are selected
by CV and GCV, denoted by KRRCV and KRRGCV , respectively. The parameters
in SVR are selected by CV. A recent work [87] introduces the use of multiple kernels
into SVR, allowing a multiple kernel learning approach for regression by using semi-
infinite linear programming. Later, [76] shows how regression with multiple kernel
learning is performed by quadratically constrained quadratic programming. Here
we compare to the approach in [76], and denote it MKL-SVR. Another work [13]
performs multiple kernel learning in the context of KRR. We denote it MKL-KRR.
We also provide comparative results with transductive learning, the traditional GA,
and uniform kernel combination, denoted KRRT , KRRT R and KRRU . The results of
KP for generalizing to new data are also reported, denoted KRRKP .
The regression performances of all algorithms are shown in Table 5.8. The pro-
posed kernel approach is generally superior to the other state-of-the-art algorithms.
We also show the training time in Table 5.9. We see that our algorithm takes com-
parable training time than the other two multiple kernel learning algorithm, i.e.,
152
Table 5.8: Mean and standard deviation of the RMSE.
Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
Housing 2.75(0.77) 2.66(0.46) 5.73(2.54) 2.93(0.90) 3.27(0.79) 3.35(1.08) 3.35(1.30) 3.11(1.09) 2.52(0.77) 2.53(0.84)
Mpg 2.24(0.26) 2.73(0.60) 2.96(0.50) 2.50(0.45) 2.62(0.28) 2.96(0.60) 3.01(0.66) 2.82(0.73) 2.76(0.35) 2.70(0.35)
Normtemp 5.56(1.15) 6.79(1.08) 7.24(1.40) 6.45(1.15) 7.00(0.79) 7.44(0.85) 7.35(1.30) 7.58(1.60) 7.85(0.80) 8.32(1.42)
Puma-8nm 1.40(0.02) 1.44(0.02) 3.11(0.02) 1.51(0.03) 1.62(0.02) 1.60(0.03) 1.44(0.03) 2.27(0.42) 1.70(0.04) 1.77(0.04)
Puma-8nh 3.52(0.06) 3.52(0.11) 4.18(0.11) 3.61(0.07) 3.56(0.08) 3.54(0.09) 3.46(0.13) 3.68(0.08) 3.72(0.07) 3.66(0.07)
Kin-8nm 0.10(0.003) 0.11(0.003) 0.16(0.002) 0.12(0.001) 0.14(0.002) 0.13(0.004) 0.11(0.002) 0.12(0.01) 0.10(0.003) 0.11(0.002)
Kin-8nh 0.18(0.003) 0.18(0.003) 0.21(0.004) 0.20(0.003) 0.19(0.004) 0.19(003) 0.19(0.005) 0.19(0.009) 0.18(0.004) 0.18(0.004)
Table 5.9: Average training time (in seconds) of each algorithm.

Data-set GAOurs GAT CV GCV MKL-SVR MKL-KRR
Housing 6.6 103 9.7 104 1.4 103 3.6 102 5.5 103 5.8 103
Mpg 1.8 103 4.5 103 7.7 102 71.0 2.0 103 1.8 103
Normtemp 150.0 550.0 19.6 3.7 80.1 46.9
Puma-8nm 4.0 104 2.0 105 1.7 104 9.9 102 2.4 104 1.7 104
Puma-8nh 4.0 104 1.9 105 1.3 104 8.3 102 2.0 104 1.3 104
Kin-8nm 3.7 104 1.7 105 9.0 103 8.9 102 2.4 104 1.7 104
Kin-8nh 3.6 104 1.7 105 2.0 104 1.1 103 1.7 104 1.7 104
MKL-SVR and MKL-KRR, but has an advantage that better prediction is achieved.
To conclude, we apply our approach to age estimation from images of face. Aging
process can induce significant changes in human facial appearances, which is generally
detectable in images. We used the FG-NET aging database of [2] to model these
changes. This database includes 1,002 face images of 82 subjects at different ages.
The ages range from 0 to 69. Face images include changes in illumination, pose,
expression and occlusion (e.g., glasses and beards). All images are warped to a
153
Table 5.10: MAE of the proposed approach and the state-of-the-art in age estimation.
Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
MAE 5.87(0.22) 5.95(0.31) 12.89(0.65) 6.31(0.30) 6.59(0.31) 13.83(0.79) 6.46(0.35) 7.18(0.46) 27.2(19.7) 8.05(0.40)
standard size of 60 60 pixels with all major facial features properly aligned, as in
[60]. We represent each image as a vector concatenating all the pixels of the image,
i.e., the appearance-based feature representation.
We generate five random partitions of the data, each with 800 images for training
and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. Again, We
can see that the proposed approach outperforms the other algorithms in predicting
the age of individuals.
5.6 Conclusions
We have proposed a genetic-based optimization mechanism to find the kernel map
minimizing the classification error of complex, non-linearly separable problems. In
particular, we introduced a coding-non-coding representation and defined three novel
operators transposition, insertion and deletion. These include viral infections that
result in DNA changes and yields an efficient search strategy within the vast space
of all possible kernel matrices. Regression is then used to estimate the underlying
mapping function given by the resulting kernel matrix, resolving the complexity is-
sues of transductive learning. We also extend the proposed kernel matrix learning
154
framework to work in regression. Comparative results against classical kernel meth-
ods demonstrate the superiority of the proposed approach. We have also shown fast
convergence on the databases used.
155
CHAPTER 6
CONCLUSIONS AND FUTURE WORK
6.1 Conclusions
Kernel methods have been extensively used in machine learning and shown to have
good generalization ability in many applications. A key problem in kernel methods
is how to determine the mapping model that leads to better learning and improved
generalization performance. This dissertation gives a comprehensive study of the
model selection problems for kernel methods in pattern recognition and machine
learning. We focus on two typical scenarios in supervised learning: classification and
regression. In each scenario, we have proposed several novel approaches to learning.
This involves learning both the kernel mappings and parameters.
In Chapter 2, we derived two criteria to optimize the kernel parameters given a
parametrized kernel function in classification. Many approaches have been proposed
for kernel optimization in the literature, but these are not directly related to the
idea of the Bayes optimal classifier in the kernel space, which is the classifier with the
smallest possible classification error. Our approaches are inspired by Bayes optimality
and we fully exploit this idea. In the first approach, we want to achieve the original
goal of the kernel mapping: the class distributions in the kernel space can be linearly
separated. To do this, we first derive a homoscedastic criterion which measures the
156
degree of homoscedasticity of the class distributions. Then, the kernel parameters
can be optimized by simultaneously maximizing the homoscedasticity and separabil-
ity between the pairwise class distributions. This optimization enforces the linear
separability of the classes to the largest extent. To relax the single Gaussian distri-
bution assumption for each class, we use a mixture of Gaussians to define each class
and show that our criterion can be easily modified to adapt to this new modeling. We
also show how our approach can be efficiently employed using a quasi-Newton based
optimization technique.
In the second criterion, instead of exploring a linear classifier, we directly mini-
mize the Bayes classification error in the kernel space over all the kernel mappings
to optimize the kernel parameters. This is plausible because different kernel presen-
tations result in different Bayes error. We first derive an effective measure which
approximates the Bayes accuracy (defined by one minus Bayes error) in the kernel
space, and then maximize this measure to find the optimal kernel parameters. We
further show how to employ our criterion to discover the underlying subclass divi-
sions in each class. Extensive experiments using a number of well-known databases in
object categorization, face recognition, handwritten digit classification demonstrate
both the effectiveness and efficiency of our methods over the state of the art.
In Chapter 3, we propose a framework to do model selection in kernel-based re-
gression approaches. Model selection in linear regression has been largely studied.
However, it is not adequately explored in nonlinear regression, The goal is to achieve
a good balance between the model fit and model complexity in a regression model.
From the well-known bias-variance trade-off, we know we cannot simultaneously re-
duce both of them. If one is reduced, the other increases, and the vice versa. We
157
first derive measures for model fit and model complexity from a decomposition of the
generalization error of the learned function and show that balancing the two measures
is equivalent to minimizing the generalization error. Then, we adopt a multiobjec-
tive optimization approach to balance the two measures by exploring Pareto-optimal
solutions. A modified -constraint method is presented to guarantee the solutions
to be Pareto-optimal. The proposed model selection approach is applied to kernel
ridge regression and kernel principal component regression, which are two popularly
used kernel-based regression methods. Experiments using many benchmark data-sets
show that the proposed approach performs generally better than other model selection
methods and state of the art regression approaches.
In kernel methods literature, the Gaussian RBF kernel is one of the most popularly
and successfully used kernels. In this kernel, the sample similarity is evaluated using
a fixed local window size. Thus, the estimation with over-fitting or under-fitting
problems may arise if the local data density changes. We introduce a new family of
kernels called Local Density Adaptive Kernels in Chapter 4. The window size of our
kernels can vary to adaptively fit the local data density, thus giving a better likelihood
evaluation. By implicitly changing the shape of our kernels, we show that our kernels
are Mercer kernels, and hence can be directly used in any kernel methods such as
Kernel Discriminant Analysis and Support Vector Machine. We then show that our
kernels outperform the fixed-shape kernels such as the RBF kernel and Mahalanobis
kernel in many applications.
Thus far we only consider one single kernel function in kernel methods. In many
applications, the use of multiple kernel functions would be more appropriate since it
combines the characteristics of all kernels, leading to better learning. In the literature,
158
many approaches have been proposed to construct a linear or nonlinear combination
of multiple kernels, which needs a pre-specified formulation for combination. Un-
fortunately, no prior knowledge is available to indicate which combination is better.
To resolve this, we introduced a new multiple kernel learning approach in Chap-
ter 5 by employing genetic algorithms. The main advantage of our method is that
there is no need to specify an explicit combination of multiple kernels, and the ker-
nel matrix can evolve during the generations using the genetic operators until the
classification/prediction error falls below a given threshold. We also introduce a new
genetic representation for each kernel matrix and present more advanced operators to
facilitate the optimization process. We then show how to learn a mapping function
represented by the learned kernel matrix to generalize to the test data. We applied
our kernel matrix learning algorithm to both classification and regression.
6.2 Future work
In this dissertation we have addressed one important problem in kernel meth-
ods, i.e., model selection. This problem directly determines the performance of ker-
nel methods. Another important problem is the computational cost of these kernel
methods. This involves both computational time and memory. For a data-set with n
samples, the complexity of a kernel algorithm is typically O(n3 ). If n is very large,
then it is computationally expensive. Also, a kernel algorithm usually requires at
least several n n matrices to be stored in the memory, which needs a large amount
of memory space when n is large. Since the size of the real world data is commonly
huge, if we want to apply the kernel methods to such data, we need to find some
way to reduce the computational cost in order to make it work efficiently in practice.
159
One possible solution is to define some sparse learning techniques. For example, the
learning model could be represented by a smaller size of the data, i.e., to obtain a rep-
resentative subset of the data during learning. This could be extremely useful when
high redundancy exists in the data. We can also explore how our model selection
approaches can be adapted to sparse learning techniques.
Another problem is model selection in other machine learning applications. Thus
far, we only consider classification and regression. There are many other useful ap-
plications such as data clustering, manifold learning, ranking, etc. Since the goal
in these applications are generally different from that in classification or regression,
different model selection methods are needed for each specific application domain.
160
BIBLIOGRAPHY
[1] S. Abe. Training of support vector machines with mahalanobis kernels. In Proc.
International Conference on Artificial Neural Networks, pages 571576, 2005.
[2] FG-NET aging database. http://www.fgnet.rsunit.com/.
[3] E. E. Andersen and A. D. Andersen. The mosek interior point optimizer for
linear programming: An implementation of the homogeneous algorithm. High
Performance Optimization, pages 197232, 2002.
[4] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning,

conic duality, and the SMO algorithm. In Proc. International Conference on
Machine Learning, pages 4148, 2004.
[5] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel

approach. Neural Computation, 12(10):28352404, 2000.
[6] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University

Press, 1995.
[7] C. L. Blake and C. J. Merz. UCI repository of machine

learning databases. University of California, Irvine, http
://www.ics.uci.edu/mlearn/MLRepository.html, 1998.
[8] L. Bregman. The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR
Comp. Mathematics and Mathematical Physics, 7:200217, 1967.
[9] L. Breiman, W. Meisel, and E. Purcell. Variable kernel estimate of multivariate

densities. Technometrics, 19:135144, 1977.
[10] A. B. Chan and N. Vasconcelos. Probabilistic kernels for the classification of

auto-regressive visual processes. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition, pages 846851, 2005.
161
[11] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple
parameters for support vector machines. Machine Learning, 46(1-3):131159,
2002.
[12] B. Chen, L. Yuan, H. Liu, and Z. Bao. Kernel subclass discriminant analysis.
Neurocomputing, 2007.
[13] C. Cortes, M. Mohri, and A. Rostamizadeh. L2-regularization for learning ker-

nels. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial
Intelligence, 2009.
[14] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations

of kernels. In Advances in Neural Information Processing Systems, 2009.
[15] K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In

Advances in Neural Informaiton Processing systems, pages 537544, 2003.
[16] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Dynamically adapting ker-

nels in support vector machines. In Advances in neural information processing
systems II, pages 204 210, 1998.
[17] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel align-

ment. In Advances in Neural Information Processing Systems 14, 2002.
[18] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target

alighment. In Proc. Advances in Neural Informaiton Processing systems, pages
367373, 2001.
[19] F. De la Torre and O. Vinyals. Learning kernel expansions for image classifica-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pages 17, 2007.
[20] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal
Machine Learning Research, 7:130, 2006.
[21] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Optimization

and Nonlinear Equations. Englewood Cliffs, NJ: Prentice-Hall, 1983.
[22] A. Desai, H. singh, and V. Pudi. Gear: Generic, efficient, accurate knn-based
regression. In Intl Conf on Knowledge Discovery and Information Retrieval,
2010.
[23] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods

for the classification of tumors in gene expression data. Technical Report 576,
University of California Berkeley, Dept. of Statistics, 2000.
162
[24] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.
Annals of Statistics, 32(2):407499, 2004.
[25] A. Elgammal, R. Duraiswami, and L. S. Davis. Efficient kernel density esti-

mation using the fast gauss transform with applications to color modeling and
tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(11):14991504, 2003.
[26] G. Fan and J. Gray. Regression tree analysis using target. Journal of Compu-
tational and Graphical Statistics, 14(1):113, 2005.
[27] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals

of Eugenics, 7, 1936.
[28] R. A. Fisher. The statistical utilization of multiple measurements. Annals of

Eugenics, 8:376386, 1938.
[29] Data for Evaluating Learning in Valid Experiments (DELVE).

http://www.cs.toronto.edn/ delve/. university of toronto, toronto, ontario,
canada.
[30] J. H. Friedman. Regularized discriminant analysis. Journal of the American

Statistical Association, 84:165175, 1989.
[31] K. Fukunaga and J.M. Mantock. Nonparametric discriminant analysis. IEEE

Trans. Pattern Analysis and Machine Intelligence, 5:671678, 1983.
[32] Keinosuke Fukunaga. Introduction to statistical pattern recognition (2nd ed.).

Academic Press, San Diego, CA, 1990.
[33] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Un-

certainty in Artificial Intelligence, pages 148155, 1998.
[34] T. Glasmachers and C. Igel. Maximum likelihood model selection for 1-norm
soft margin svms with multiple parameters. IEEE Transactions on Pattern
Analysis and Machine Intelligenc, 32(8):2010, 1522-1528.
[35] C. Gold and P. Sollich. Model selection for support vector machine classification.
Neurocomputing, 55:221249, 2003.
[36] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine

Learning. Kluwer Academic Publishers, Boston, MA, 1989.
[37] G. H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a

method for choosing a good ridge parameter. Technometrics, 21(2):215223,
1979.
163
[38] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In Proc. Inter-
national Conference on Machine Learning, 2008.
[39] Y. Y. Haimes, L. S. Lasdon, and D. A. Wismer. On a bicriterion formulation
of the problems of integrated system identification and system optimization.
IEEE Transactions on Systems, Man, and Cybernetics, pages 296297, 1971.
[40] O. C. Hamsici and A. M. Martinez. Bayes optimality in linear discriminant
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30:647657, 2008.
[41] O. C. Hamsici and A. M. Martinez. Rotation invariant kernels and their appli-
cation to shape analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2009.
[42] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer-Verlag (2nd Edition), New York, NY, 2001.
[43] X. He and P. Niyogi. Locality preserving projections. In Proc. Advances in
Neural Information Processing Systems 16, 2004.
[44] L. Holmstrom and P. Koistinen. Using additive noise in back-propagation train-
ing. IEEE Transactions on Neural Networks, 3(1):2438, 1992.
[45] E. Hu, S. Chen, D. Zhang, and Yin X. Semisupervised kernel matrix learning by
kernel propagation. IEEE Transactions on Neural Networks, 21(11):18311841,
2010.
[46] T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to
detect remote protein homologies. In Proc. Internation Conference on Intelli-
gent Systems for Molecular Biology, pages 149158, 1999.
[47] N. Karmarkar. A new polynomial time algorithm for linear programming. Com-
binatorica, 4(4):373395, 1984.
[48] V. Katkovnik and I. Shmulevich. Kernel density estimation with varying win-
dow size. Pattern Recognition Letters, 23:16411648, 2002.
[49] S.J. Kim, A. Magnani, and S. Boyd. Optimal kernel selection in kernel fisher
discriminant analysis. In Int. Conf. Machine Learning, pages 465472, 2006.
[50] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. lp norm multiple kernel
learning. Journal of Machine Learning Research, 12:953997, 2011.
[51] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan.
Learning the kernel matrix with semidefinite programming. Journal of Machine
Learning Research, 5:2772, 2004.
164
[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of IEEE, 92(11):22782324, 1998.
[53] B. Leibe and B. Schiele. Analyzing appearance and contour based methods
for object categorization. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, 2003.
[54] J. Liu, J. Chen, S. Chen, and J. Ye. Learning the optimal neighborhood kernel
for classification. In International Joint Conference on Artificial Intelligence,
Pasadena, California, 2009.
[55] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text

classification using string kernels. Journal of Machine Learning Research, 2:419
444, 2002.
[56] D. Loftsgaarden and C. Quesenberry. A nonparametric estimate of a multi-

variate density function. Annals of Mathimatical Statistics, 36(3):10491051,
1965.
[57] M. Loog and R. P. W. Duin. Linear dimensionality reduction via a heteroscedas-

tic extension of lda: The chernoff criterion. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(6):732739, 2007.
[58] M. Loog, R. P. W. Duin, and R. Haeb-Umbach. Multiclass linear dimension

reduction by weighted pairwise fisher criteria. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(7):762766, 2001.
[59] J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications

in Statistics and Econometrics, 2nd Edition. John Wiley and Sons, 1999.
[60] A. M. Martinez. Recognizing imprecisely localized, partially occluded and ex-

pression variant faces from a single sample per class. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(6):748763, 2002.
[61] A. M. Martinez and R. Benavente. The AR Face Database. CVC Technical

Report No. 24, June, 1998.
[62] A. M. Martinez and O. C. Hamsici. Who is LB1? discriminant analysis for the
classification of specimens. Pattern Rec., 41:34363441, 2008.
[63] A. M. Martinez and M. Zhu. Where are linear feature extraction methods
applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(12):19341944, 2005.
165
[64] B. McClintock. The origin and behavior of mutable loci in maize. In Proceedings
of the National Academy of Sciences of the USA, volume 36, pages 344355,
1950.
[65] G. McLachlan and K. Basford. Mixture Models: Inference and applications to
clustering. Marcel Dekker, 1988.
[66] K. Miettinen. Nonlinear Multiobjective Optimization, volume 12 of Interna-
tional Series in Operations Research and Management Science. Kluwer Aca-
demic Publishers, Dordrecht, 1999.
[67] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller. Fisher discriminant
analysis with kernels. In Proc. IEEE Neural Networks for Signal Processing
Workshop, pages 4148, 1999.
[68] M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996.
[69] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A kullback-leibler divergence based
kernel for svm classification in multimedia applications. In Advances in Neural
Information Processing Systems, 2003.
[70] E. A. Nadaraya. On estimating regression. Theory of Probability and its Appli-
cations, 9:141142, 1964.
[71] M. H. Nguyen and F. De la Torre. Robust kernel principal component analysis.
In Advances in Neural Information Processing Systems, 2008.
[72] F. Odone, A. Barla, and A. Verri. Building kernels from binary strings for image
matching. IEEE Transactions on Image Processing, 14(2):169180, 2005.
[73] E. Parzen. On estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33:10651076, 1962.
[74] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for
predictivity in learning theory. Nature, 428:419422, 2004.
[75] O. Pujol and D. Masip. Geometry-based ensembles: Towards a structural char-
acterization of the classification boundary. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 31(6):11401146, 2009.
[76] S. Qiu and T. Lane. A framework for multiple kernel support vector regression
and its applications to sirna efficacy prediction. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 6(2):190199, 2009.
[77] Y. Radhika and M. Shashi. Atmospheric temperature prediction using support
vector machines. International Journal of Computer Theory and Engineering,
1(1):5558, 2009.
166
[78] C. R. Rao. The utilization of multiple measurements in problems of biological
classification. J. Royal Statistical Soc., B, 10:159203, 1948.
[79] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. the
MIT Press, 2006.
[80] C. E. Rasmussen and Z. Ghahramani. Occams razor. In Advances in Neural

Information Processing Systems 13, 2001.
[81] P. Russell. Genetics. Addison- Wesley, 1998.
[82] B. Scholkopf. The kernel trick for distances. In Advances in Neural Information
Processing Systems, pages 301307, 2000.
[83] Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller. Nonlinear

component analysis as a kernel eigenvalue problem. Neural Compututation,
10(5):12991319, 1998.
[84] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
[85] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cam-
bridge University Press, 2004.
[86] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression
(PIE) database. In Proceedings of the 5th IEEE International Conference on
Face and Gesture Recognition, 2002.
[87] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple

kernel learning. Journal of Machine Learning Research, 7:15311565, 2006.
[88] M. Stone. Cross-validatory choice and assessment of statistical predictions (with

discussion). Journal of the Royal Statistical Society, Series B, 36:111147, 1974.
[89] G. Terrell and D. Scott. Variable kernel density estimation. The Annals of
Statistics, 20(3):12361265, 1992.
[90] C. M. Theobald. An inequality for the trace of the product of two symmetric
matrices. Proceedings of the Cambridge Philosophical Society, 77:256267, 1975.
[91] M. E. Tipping. Sparse bayesian learning and the relevance vector machine.
Journal of Machine Learning Research, (1):211244, 2001.
[92] V. Vapnik. The Nature of Statistical Learning Theory. New York: Springer,
1995.
[93] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
167
[94] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function
approximation, regression estimation, and signal processing. In InM.Mozer,M.
Jordan, T. Petsche (Eds.), Advances in neural information processing systems,
9, The MIT Press, Cambridge, MA, 1996.
[95] M. Varma and B. R. Babu. More generality in efficient multiple kernel learning.
In Proc. International Conference on Machine Learning, pages 465472, 2009.
[96] Grace Wahba. Spline Models for Observational Data. Society for Industrial and
Applied Mathematics, 1990.
[97] J. Wang, H. P. Lu, K. N. Plataniotis, and J. W. Lu. Gaussian kernel optimiza-

tion for pattern classification. Pattern Recognition, 42(7):12371247, 2009.
[98] L. Wang, K.L. Chan, P. Xue, and L.P. Zhou. A kernel-induced space selection
approach to model selection in klda. IEEE Trans. Neural Networks, 19:2116
2131, 2008.
[99] S. Wang, W. Zhu, and Z. Liang. Shape deformation: Svm regression and
application to medical image segmentation. In Proceedings of International
Conference on Computer Vision, 2001.
[100] Yong Wang. A New Approach to Fitting Linear Models in High Dimensional
Spaces. PhD dissertation, University of Waikato, 2000.
[101] Z. Wang, S. C. Chen, and T. K. Sun. Multik-mhks: A novel multiple ker-

nel learning algorithm. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(2):348353, 2008.
[102] Cambridge weather database. http://www.cl.cam.ac.uk/research

/dtg/weather/. University of Cambridge.
[103] K. Q. Weinberger and G. Tesauro. Metric learning for kernel regression. In

Eleventh International Conference on Artificial Intelligence and Statistics, Om-
nipress, 2007.
[104] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression.

In In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances
in Neural Information Processing Systems 8, pages 514520, The MIT Press,
Cambridge, MA, 1996.
[105] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Jour-
nal of Machine Learning Research, 4:913931, 2003.
168
[106] G. Wu and E. Chang. Adaptive feature-space conformal transofrmation for
imbalanced-data learning. In Proc. International Conference on Machine Learn-
ing, pages 816823, 2003.
[107] S. Wu and S. Amari. Conformal transformation of kernel functions: A data-

depedent way to improve support vector machine classifiers. Neural Processing
Letters, 15:5967, 2002.
[108] H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the kernel in the
empirical feature space. IEEE Transactions on Neural Networks, 16(2):460
474, 2005.
[109] Jian Yang, Alejandro F. Frangi, Jing-yu Yang, David Zhang, and Zhong Jin.
KPCA plus LDA: A complete kernel fisher discriminant framework for feature
extraction and recognition. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 27(2):230244, 2005.
[110] Ming-Hsuan Yang. Kernel eigenfaces vs. kernel fisherfaces: Face recognition
using kernel methods. In Proc. IEEE International Conference on Automatic
Face and Gesture Recognition, 2002.
[111] J. Ye, S. Ji, and J. Chen. Multi-class discriminant kernel learning via convex
programming. J. Machine Lear. Res., 9:719758, 2008.
[112] D. Yeung, H. Chang, and G. Dai. Learning the kernel matrix by maximizing a
kfd-based class separability criterion. Pattern Recognition, 40:20212028, 2007.
[113] D. You, O. C. Hamsici, and A. M. Martinez. Kernel optimization in discriminant

analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
33(3):631638, 2011.
[114] D. You and A. M. Martinez. Bayes optimal kernel discriminant analysis. In

Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 3533
3538, 2010.
[115] S. Zhou, B. Georgescu, X. Zhou, and D. Comaniciu. Image based regression

using boosting method. In Proceedings of the Tenth IEEE International Con-
ference on Computer Vision, 2005.
[116] M. Zhu and A. M. Martinez. Subclass discriminant analysis. IEEE Trans.

Pattern Analysis and Machine Intelligence, 28(8):12741286, 2006.
[117] M. Zhu and A. M. Martinez. Pruning noisy bases in discriminant analysis.

IEEE Transactions Neural Networks, 19(1):148157, 2008.
169

Osu 1322581224

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Osu 1322581224

Enviado por

Direitos autorais:

Formatos disponíveis

Model Selection in Kernel Methods

Presented in Partial Fulfillment of the Requirements for

Graduate Program in Electrical and Computer Engineering

The Ohio State University

applications. A main advantage of kernel methods is that nonlinear problems such as

different generalization performance. Hence, model selection in kernel methods is an

important problem and remains a challenge in the literature. In this dissertation, we

In classification, we develop an algorithm yielding class distributions that are

We show how this approach can be employed to optimize kernels in discriminant

analysis. We then derive a criterion to search for a good kernel representation by

In regression, we derive a model selection approach to directly balance the model

fit and model complexity using the framework of multiobjective optimization. We

is related to minimizing the predicted generalization error of the learning function.

rithms until the classification/prediction error falls below a threshold. We derive a

superiority over the commonly used fixed-shape kernels.

perior to the state of the art.

He guides me towards the completion of this dissertation and my PhD study.

and really have a good time in the lab.

Finally, but most importantly, I want to thank my parents and my wife. It is my

accompanying me throughout this process.

This research was partially supported by the US National Institutes of Health

under grant R21 DC 011081 and R01 EY 020834.

July 27, 1984 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Jiamusi, Heilongjiang, China

2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,

D. You, O. C. Hamsici and A. M. Martinez. Kernel Optimization in Discriminant

D. You and A. M. Martinez. Bayes Optimal Kernel Discriminant Analysis. in

D. You and A. M. Martinez. Multiobjective Optimization for Model Selection in

D. You and A. M. Martinez. Local Density Adaptive Kernels. submitted to IEEE

Major Field: Electrical and Computer Engineering

Studies in Pattern Recognition and Computer Vision: Prof. Aleix M. Martinez

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1.1 Kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Kernel Learning in Discriminant Analysis . . . . . . . . . . . . . . . . . 21

3. Model Selection in Kernel Methods in Regression . . . . . . . . . . . . . 62

4. Local Density Adaptive Kernels . . . . . . . . . . . . . . . . . . . . . . . 101

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5. Kernel Matrix Learning with Genetic Algorithms . . . . . . . . . . . . . 119

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6. Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

2.1 Recognition rates (in percentages) with nearest mean . . . . . . . . . 54

2.2 Recognition rates (%) with nearest neighbor . . . . . . . . . . . . . . 55

2.3 Recognition rates (%) with the smooth nearest-neighbor classifier . . 55

2.4 Recognition rates (%) with linear SVM . . . . . . . . . . . . . . . . . 56

2.5 Training time (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . 56

2.7 Recognition rates (%) with the classification method of [75]. . . . . . 58

2.8 Recognition rates (%) with linear SVM. . . . . . . . . . . . . . . . . . 59

3.3 Mean and standard deviation of RMSE of different methods. . . . . . . . 93

3.4 Comparison of our results with the state of the art. . . . . . . . . . . . . 93

3.5 Regression performance with alternative optimization criteria. . . . . . . . 94

3.6 Comparison with L2 norm. . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.8 RMSE of several approaches applied to weather prediction. . . . . . . . . 98

4.1 Recognition rates (%) with CV in UCI data-sets. . . . . . . . . . . . 114

4.3 Recognition rates (%) with CV in ETH-80. . . . . . . . . . . . . . . . 116

4.4 Recognition rates (%) with KBA criterion in ETH-80. . . . . . . . . . 116

4.5 Recognition rates (%) with CV in PIE database. . . . . . . . . . . . . 117

5.1 The parameters used in the experiments . . . . . . . . . . . . . . . . 141

5.3 SVM Recognition rates (%) in the UCI data-sets. . . . . . . . . . . . 142

5.5 KDA Recognition rates (%) for large data-sets. . . . . . . . . . . . . 146

5.6 SVM Recognition rates (%) for large data-sets. . . . . . . . . . . . . . 147

5.8 Mean and standard deviation of the RMSE. . . . . . . . . . . . . . . 153

5.9 Average training time (in seconds) of each algorithm. . . . . . . . . . 153

1.2 Here we show an example of two non-linearly separable class distributions,

5.2 This figure illustrates the copy-and-paste transposition. . . . . . . . . . 126