Você está na página 1de 186

Model Selection in Kernel Methods

Dissertation

Presented in Partial Fulfillment of the Requirements for


the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University

By

Di You, M.S.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2011

Dissertation Committee:
Aleix M. Martinez, Adviser
Yuan F. Zheng
Yoonkyung Lee
c Copyright by

Di You

2011
ABSTRACT

Kernel methods have been extensively studied in pattern recognition and machine

learning over the last decade, and they have been successfully used in a variety of

applications. A main advantage of kernel methods is that nonlinear problems such as

classification and regression can be efficiently solved using classical linear approaches.

The performance of kernel methods greatly depends on the selected kernel model. The

model is defined by the kernel mapping and its parameters. Different models result in

different generalization performance. Hence, model selection in kernel methods is an

important problem and remains a challenge in the literature. In this dissertation, we

propose several approaches to address this problem. Our approaches can determine

good learning models by optimizing both the kernels and all other parameters in the

kernel-based algorithms.

In classification, we develop an algorithm yielding class distributions that are

linearly separable in the kernel space. The idea is to enforce the homoscedasticity

and separability of the pairwise class distributions simultaneously in the kernel space.

We show how this approach can be employed to optimize kernels in discriminant

analysis. We then derive a criterion to search for a good kernel representation by

directly minimizing the Bayes classification error over different kernel mappings.

In regression, we derive a model selection approach to directly balance the model

fit and model complexity using the framework of multiobjective optimization. We

ii
develop an algorithm to obtain the Pareto-optimal solutions which balance the trade-

off between the model fit and model complexity. We show how the proposed method

is related to minimizing the predicted generalization error of the learning function.

In our final algorithm, the kernel matrix is recursively learned with genetic algo-

rithms until the classification/prediction error falls below a threshold. We derive a

family of adaptive kernels to better fit the data with various densities and show their

superiority over the commonly used fixed-shape kernels.

Extensive experimental results demonstrate that the proposed approaches are su-

perior to the state of the art.

iii
To my parents and my wife

iv
ACKNOWLEDGMENTS

First of all, I greatly thank my advisor, Dr. Aleix M. Martinez for his guidance,

support and patience to my PhD work. I have learned a lot from him including the

rigorous scientific attitude, methods to do good research, and the spirit of a researcher.

He guides me towards the completion of this dissertation and my PhD study.

I also would like to thank all my friends and my labmates: Onur Hamsici, Hongjun

Jia, Liya Ding, Paulo Gotardo, Samuel Riveras, Fabian Benitez-Quiroz, Shichuan Du,

Yong Tao, and Felipe Giraldo. I benefit a lot from the many discussions with them

and really have a good time in the lab.

Finally, but most importantly, I want to thank my parents and my wife. It is my

parents who give me the endless love, care, and support so that I can finish this long

and difficult process. I am also grateful to my wife for her love and encouragement

accompanying me throughout this process.

This research was partially supported by the US National Institutes of Health

under grant R21 DC 011081 and R01 EY 020834.

v
VITA

July 27, 1984 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Jiamusi, Heilongjiang, China

2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,


Harbin Institute of Technology, China
2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Department of Electrical and
Computer Engineering,
The Ohio State University, USA
2007-2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate,
Department of Electrical and Com-
puter Engineering,
The Ohio State University, USA

PUBLICATIONS

Research Publications

D. You, O. C. Hamsici and A. M. Martinez. Kernel Optimization in Discriminant


Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33,
no. 3, pp. 631-638, 2011.

D. You and A. M. Martinez. Bayes Optimal Kernel Discriminant Analysis. in


Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3533-3538,
2010.

D. You and A. M. Martinez. Multiobjective Optimization for Model Selection in


Kernel Methods in Regression. submitted to IEEE Transactions on Pattern Analysis
and Machine Intelligence.

vi
D. You and A. M. Martinez. Kernel Matrix Learning with Genetic Algorithm.
submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.

D. You and A. M. Martinez. Local Density Adaptive Kernels. submitted to IEEE


Transactions on Neural Networks.

FIELDS OF STUDY

Major Field: Electrical and Computer Engineering

Studies in Pattern Recognition and Computer Vision: Prof. Aleix M. Martinez

vii
TABLE OF CONTENTS

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Chapters:

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Kernel parameter selection . . . . . . . . . . . . . . . . . . 10
1.2.2 Kernel matrix learning . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 New kernel development . . . . . . . . . . . . . . . . . . . . 16
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 17

2. Kernel Learning in Discriminant Analysis . . . . . . . . . . . . . . . . . 21

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The metrics of discriminant analysis . . . . . . . . . . . . . . . . . 24
2.3 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Maximizing homoscedasticity . . . . . . . . . . . . . . . . . 30
2.3.2 Derivation of the Gradient . . . . . . . . . . . . . . . . . . . 38
2.3.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 40

viii
2.4 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Bayes accuracy in the kernel space . . . . . . . . . . . . . . 43
2.4.2 Kernel parameters with gradient ascent . . . . . . . . . . . 45
2.4.3 Subclass extension . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.4 Optimal subclass discovery . . . . . . . . . . . . . . . . . . 47
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 Homoscedastic criterion . . . . . . . . . . . . . . . . . . . . 50
2.5.2 KBA criterion . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3. Model Selection in Kernel Methods in Regression . . . . . . . . . . . . . 62

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.1 Generalization error . . . . . . . . . . . . . . . . . . . . . . 66
3.2.2 Model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.3 Roughness penalty in RBF . . . . . . . . . . . . . . . . . . 70
3.2.4 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.5 Comparison with other complexity measure . . . . . . . . . 73
3.3 Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Pareto-Optimality . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.2 The -constraint approach . . . . . . . . . . . . . . . . . . . 77
3.3.3 The modified -constraint . . . . . . . . . . . . . . . . . . . 79
3.3.4 Alternative Optimization Approaches . . . . . . . . . . . . . 83
3.4 Applications to Regression . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . 84
3.4.2 Kernel Principal Component Regression . . . . . . . . . . . 86
3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.1 Standard data-sets . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.2 Comparison with the state of the art . . . . . . . . . . . . . 92
3.5.3 Alternative Optimizations . . . . . . . . . . . . . . . . . . . 95
3.5.4 Comparison with the L2 norm . . . . . . . . . . . . . . . . . 95
3.5.5 Age estimation . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.6 Weather prediction . . . . . . . . . . . . . . . . . . . . . . . 97
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4. Local Density Adaptive Kernels . . . . . . . . . . . . . . . . . . . . . . . 101

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


4.2 Local Density Adaptive Kernels . . . . . . . . . . . . . . . . . . . . 104
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.2 Defining Mercer kernels . . . . . . . . . . . . . . . . . . . . 105

ix
4.2.3 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2.4 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Kernel Parameter Selection . . . . . . . . . . . . . . . . . . . . . . 112
4.3.1 k-fold cross-validation . . . . . . . . . . . . . . . . . . . . . 112
4.3.2 Kernel Bayes accuracy criterion . . . . . . . . . . . . . . . . 112
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.1 UCI benchmark data-sets . . . . . . . . . . . . . . . . . . . 113
4.4.2 Image databases . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5. Kernel Matrix Learning with Genetic Algorithms . . . . . . . . . . . . . 119

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


5.2 Learning with Genetic Algorithms . . . . . . . . . . . . . . . . . . 123
5.2.1 Feature representation . . . . . . . . . . . . . . . . . . . . . 123
5.2.2 Basic operators . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2.3 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.4 Deletion and insertion . . . . . . . . . . . . . . . . . . . . . 128
5.2.5 Selection criterion . . . . . . . . . . . . . . . . . . . . . . . 130
5.3 Generalizing to Test Samples in Classification . . . . . . . . . . . . 132
5.4 Kernel Matrix Learning in Regression . . . . . . . . . . . . . . . . 134
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5.1 A toy example . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5.2 Classification algorithms . . . . . . . . . . . . . . . . . . . . 139
5.5.3 UCI Repository . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.5.4 Large databases . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5.5 Discussions of the genetic operators . . . . . . . . . . . . . . 148
5.5.6 Application to regression . . . . . . . . . . . . . . . . . . . 151
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6. Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156


6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

x
LIST OF TABLES

Table Page

2.1 Recognition rates (in percentages) with nearest mean . . . . . . . . . 54

2.2 Recognition rates (%) with nearest neighbor . . . . . . . . . . . . . . 55

2.3 Recognition rates (%) with the smooth nearest-neighbor classifier . . 55

2.4 Recognition rates (%) with linear SVM . . . . . . . . . . . . . . . . . 56

2.5 Training time (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . 56

2.6 Recognition rates (%) with nearest neighbor. Bold numbers specify the
top recognition obtained with the three criteria in KSDA and KDA.
An asterisk specifies a statistical significance on the highest recognition
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.7 Recognition rates (%) with the classification method of [75]. . . . . . 58

2.8 Recognition rates (%) with linear SVM. . . . . . . . . . . . . . . . . . 59

3.1 Results for KRR. Mean RMSE and standard deviation (in parentheses). . 89

3.2 Results for KPCR. Mean RMSE and standard deviation (in parentheses). . 89

3.3 Mean and standard deviation of RMSE of different methods. . . . . . . . 93

3.4 Comparison of our results with the state of the art. . . . . . . . . . . . . 93

3.5 Regression performance with alternative optimization criteria. . . . . . . . 94

3.6 Comparison with L2 norm. . . . . . . . . . . . . . . . . . . . . . . . . . 95

xi
3.7 MAE of the proposed approach and the state of the art in age estimation. 97

3.8 RMSE of several approaches applied to weather prediction. . . . . . . . . 98

4.1 Recognition rates (%) with CV in UCI data-sets. . . . . . . . . . . . 114

4.2 Recognition rates (%) with KBA criterion in UCI data-sets. . . . . . 115

4.3 Recognition rates (%) with CV in ETH-80. . . . . . . . . . . . . . . . 116

4.4 Recognition rates (%) with KBA criterion in ETH-80. . . . . . . . . . 116

4.5 Recognition rates (%) with CV in PIE database. . . . . . . . . . . . . 117

4.6 Recognition rates (%) with KBA criterion in PIE database. . . . . . . 118

5.1 The parameters used in the experiments . . . . . . . . . . . . . . . . 141

5.2 KDA Recognition rates (in percentages) in the UCI data-sets. . . . . 142

5.3 SVM Recognition rates (%) in the UCI data-sets. . . . . . . . . . . . 142

5.4 Average training time (in seconds) of each algorithm in the UCI data-sets.144

5.5 KDA Recognition rates (%) for large data-sets. . . . . . . . . . . . . 146

5.6 SVM Recognition rates (%) for large data-sets. . . . . . . . . . . . . . 147

5.7 Average training time (in seconds) of each algorithm in large data-sets. 147

5.8 Mean and standard deviation of the RMSE. . . . . . . . . . . . . . . 153

5.9 Average training time (in seconds) of each algorithm. . . . . . . . . . 153

5.10 MAE of the proposed approach and the state-of-the-art in age estimation. 154

xii
LIST OF FIGURES

Figure Page

1.1 This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can
be mapped to a higher dimensional space where the data becomes linearly
separable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Here we show an example of two non-linearly separable class distributions,


each consisting of 3 subclasses. (a) Classification boundary of LDA. (b)
SDAs solution. (c) KDAs solution. . . . . . . . . . . . . . . . . . . . . 8

1.3 Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample
similarity on x is determined by the nearby samples of x. (b) The polynomial
kernel. d is the degree of the kernel. The kernel value measuring the sample
similarity on x is determined by all the samples. . . . . . . . . . . . . . 9

2.1 Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale
and rotation. (a) The value of Q1 decreases as the angle increases. The
2D rotation between the two distributions is in the x axis. The value of Q1
is in the y axis. (b) When = 0o , the two distributions are homoscedastic,
and Q1 takes its maximum value of .5. Note how for distributions that are
close to homoscedastic (i.e., 0o ), the value of the criterion remains high.
(c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3. 33

2.2 Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions
that have small training error. (c) However, when the model complexity is
small, = 3, KDA fails. (d) KSDAs solution resolves this problem with
piecewise smooth, nonlinear classifiers. . . . . . . . . . . . . . . . . . . . 41

xiii
2.3 The original data distributions are mapped to different kernel spaces via
different mapping functions (.). (2 ) is better than (1 ) in terms of the
Bayes error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4,
and (d,h) 5. The x-axis specifies the number of subclasses Hi . The y-axis
shows the value of the criterion given in (2.12) in (a-d) and of the Fisher
criterion in (e-h). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.5 (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion. . . . . . . . . . . . . . . . . . 49

2.6 Shown here are (a) 8 categories in ETH-80 database and (b) 10 different
objects for the cow category. . . . . . . . . . . . . . . . . . . . . . . . 51

2.7 Plots of the value of the derived criterion as a function of the kernel param-
eter and the number of subclasses. From left to right and top to bottom:
AR, ETH-80, Monk 1, and Ionosphere databases. . . . . . . . . . . . . . 60

3.1 The two plots in this figure show the contradiction between the RSS and
the curvature measure with respect to: (a) the kernel parameter , and (b)
the regularization parameter in Kernel Ridge Regression. The Boston
Housing data-set [7] is used in this example. Note that in both cases, while
one criterion increases, the other decreases. Thus, a compromise between
the two criteria ought to be determined. . . . . . . . . . . . . . . . . . . 72

3.2 Here we show a case of two objective functions. u(S) represents the set
of all the objective vectors with the Pareto frontier colored in red. The
Pareto-optimal solution can be determined by minimizing u1 given that
u2 is upper-bounded by . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.3 Comparison between the proposed modified and the original -constraint
methods. We have used * to indicate the objective vector and o to spec-
ify the solution vector. Solutions given by (a) the -constraint method
and (b) the proposed modified -constraint approach on the first exam-
ple, and (c) the -constraint method and (d) the modified -constraint ap-
proach on the second example. Note that the proposed approach identifies
the Pareto-frontier, while the original algorithm identifies weakly Pareto-
solutions, since the solution vectors go beyond the Pareto-frontier. . . . . 82

xiv
3.4 Sample images showing the same person at different ages. . . . . . . . . . 97

3.5 This figure plots the estimated (lighter dashed curve) and actual (darker
dashed curve) maximum daily temperature for a period of more than 200
days. The estimated results are given by the algorithm proposed in this
chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.1 A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density
Adaptive (LA) kernels are evaluated on the four points marked by . (a)
Density estimation in the RBF kernel uses a fixed window, illustrated by
black circles. Note that this fixed window cannot capture different local
densities. (b) Density estimation with the proposed LA kernel. . . . . . . 102

4.2 This figure illustrates how the local variance measurement given by (4.7)
is used. The axis represents the magnitude of the variance around each
sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3 (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under dif-
ferent covariance factors c. The proposed kernel obtains higher classification
accuracies than the RBF as c increases. . . . . . . . . . . . . . . . . . . 111

4.4 Shown here are sample images from PIE data-set. . . . . . . . . . . . . 116

5.1 (a) The classical feature representation. Each entry in the feature vector
codes for a relevant variable in the optimization problem. (b) The proposed
feature representation. Each individual in the population is represented as a
feature vector with coding and non-coding segments. The lower case letters
represent the coding (or gene) sequence used for the calculation of the fitness
function. Consecutive N labels indicate non-coding DNA. . . . . . . . . 124

5.2 This figure illustrates the copy-and-paste transposition. . . . . . . . . . 126

5.3 This figure illustrates the cut-and-paste transposition. . . . . . . . . . . 128

5.4 This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is
formed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

xv
5.5 (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the
classification accuracy over the number of generations. . . . . . . . . . . 136

5.6 In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations. . . . . . . . . . . . . . . . . . . . 137

5.7 This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations. . . . . . . . . . . . . . . . . . . . . 139

5.8 Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization
approaches applied to KDA using monk1 database, and the plots from (f)
to (j) were obtained with different optimization approaches applied to SVM
using breast cancer database. (a) and (f) show the proposed genetic-based
optimization approach. (b) and (g) show the traditional GA algorithm with
crossover and mutation only. (c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only.
(e) and (j) show GA algorithm with insertion operator only. . . . . . . . 145

xvi
CHAPTER 1

INTRODUCTION

The goal of Pattern recognition is to describe, recognize, classify, and group pat-

terns of interest. While it seems an easy task for humans, such as identifying a

person and recognizing different objects, it is very challenging to teach computers to

recognize patterns.

Over the decades, extensive research has been conducted in this field and a number

of approaches for pattern recognition have been developed. These pattern recogni-

tion techniques have been widely used in a variety of fields such as computer vi-

sion, artificial intelligence, bioinformatics, psychology and paleontology. Well-known

applications are in face recognition and verification, automated speech recognition,

fingerprint identification, DNA sequence analysis to name but a few.

Since many common pattern recognition algorithms are probabilistic in nature,

statistical pattern recognition approaches have been most intensively studied and em-

ployed in practice [32]. In such approaches, each pattern is represented in terms of

d features and is viewed as a point in a d-dimensional vector space. Patterns are

assumed to be generated by a probabilistic model, and statistical concepts and ap-

proaches are employed to build the decision boundary or model the distribution of the

data. Depending on whether the training samples are labeled or unlabeled, statistical

1
pattern recognition can be divided into two categories: supervised and unsupervised

methods.

In supervised learning, the goal is to predict a functional relationship between

the objects and their associated labels. If the labels are discrete, the correspond-

ing problem is a classification problem. Well-known approaches for classifiction are

Linear Discriminant Analysis (LDA) [32] and Support Vector Machines (SVMs) [93].

If the labels are continuous, we talk about regression. The least-squares solutions

and their variants [42] (e.g. ridge regression) are popular approaches for regression.

Unsupervised learning seeks to determine how the data are organized. Data represen-

tation (e.g., principal component analysis) and clustering (e.g., k-means) are typical

examples in this class of approaches.

In this dissertation, we focus on the supervised learning approaches.

Among the many approaches to supervise learning that have been developed thus

far, Discriminant Analysis (DA) is one of the earliest and most used techniques in

pattern recognition. It has been used for feature extraction and classification with

broad applications in, for instance, computer vision [32], gene expression analysis [23]

and paleontology [62]. In his ground-breaking work, Fisher [27, 28] derived a DA

approach for the two Normally distributed class problem, N(1 , 1 ) and N(2 , 2 ),

under the assumption of equal covariance matrices, 1 = 2 . Here, i and i are the

mean feature vector and the covariance matrix of the ith class, and N(.) represents the

Normal distribution. The assumption of identical covariances (i.e., homoscedasticity)

implies that the Bayes (optimal) classifier is linear, which is the reason why we refer

to this algorithm as Linear Discriminant Analysis (LDA). LDA thus provides the

2
one-dimensional subspace where the Bayes classification error is the smallest in the

2-class homoscedastic problem.

Fishers work is later extended to solve the multi-class classification problem in

a least-squares framework [78]. In this solution, LDA employs two symmetric, posi-

tive semi-definite matrices, each defining a metric [63]. One of these metrics should

measure within-class differences and, as such, should be minimized. The other metric

should account for between-class dissimilarity and should thus be maximized. Classi-

cal choices for the first metric are the within-class scatter matrix SW and the sample

covariance matrix X , while the second metric is usually given by the between-class

scatter matrix SB . The sample covariance matrix is defined as


n
X
X = n 1
(xi ) (xi )T , (1.1)
i=1
Pn
where X = {x1 , . . . , xn } are the n training samples, xi Rp , and = n1 i=1 xi is

the sample mean. The between-class scatter matrix is given by


C
X
SB = pi (i ) (i )T , (1.2)
i=1
P ni
where i = n1
i j=1 xij is the sample mean of class i, xij is the j th sample of class i,

ni is the number of samples in that class, C is the number of classes, and pi = ni /n

is the prior of class i. LDAs solution is then given by the generalized eigenvalue

decomposition equation 1
X SB V = V, where the columns of V are the eigenvectors,

and is a diagonal matrix of corresponding eigenvalues. Thus, the solution of LDA

indicates a C1 dimensional subspace where the between-class scatters are maximized

and the within-class scatters are minimized.

The idea of LDA is attractive, because we could obtain a linear classifier with the

smallest classification error (also known as the Bayes error) provided the fact that

3
the class distributions are single Gaussians and the covariance matrices are identical.

However, in practice, the class distributions can be highly non-Gaussian and distinct

from each other, which makes the assumption of LDA so restrictive. In other words, if

the real data distributions deviate from this underlying assumption, then LDA would

not work well. This becomes the major drawback of LDA.

To relax this assumption, numerous approaches have been proposed in the litera-

ture. Loog and Duin [57] defines a within-class similarity metric using the Chernoff

distance which incorporates the differences of both means and covariance matrices,

yielding an algorithm which can handle heteroscedastic (i.e., non-homoscedastic) dis-

tributions. Another way is to allow each class to be divided into several subclasses

by imposing a mixture of Gaussians for each class distribution. This is the underly-

ing idea of subclass DA (SDA) [116]. Since a mixture of Gaussians is more flexible

to model the underlying class distributions than a single Gaussian, this approach is

shown to perform well for a variety of applications. To loosen the parametric restric-

tion of the above assumption, Fukunaga and Mantock [31] redefines the between-class

scatter matrix in a non-parametric fashion, and the decision boundary is constructed

locally. Specifically, a local classifier for each sample is first built based upon the

sample and its local k nearest neighbors, and then the final decision boundary is

constructed by combining all the local classifiers.

The classifiers obtained from the above approaches are linear or piecewise linear.

However, such classifiers may not be adequate for a classification problem with a

highly nonlinear decision boundary. This is because that the features in such ap-

proaches are extracted from a linear combination of the features in the original space.

To derive a nonlinear classifier, a nonlinear combination of the original features would

4
be more appropriate. Recently, kernel methods have been developed to tackle the

nonlinear problem.

1.1 Kernel methods

Kernel methods have attracted great interest over the past decade and have been

shown its promise in performing nonlinear feature extraction and classification [84,

93]. The idea is to use a kernel function which maps the original nonlinearly separable

data to a very high or even infinite dimensional space where the data is linearly

separable, see Figure 1.1. Then, any efficient linear classification approach can be

employed in this so-called kernel space. Since the mapping is intrinsic, one does not

need to work with an explicit mapping function. Instead, one can employ the kernel

trick [84], allowing nonlinear formulations to be cast in terms of inner products. This

will result in a space of the same dimensionality as that of the input representation

while still eliminating the nonlinearity of the data.

Formally, suppose a training data-set {(xi , yi )}ni=1 is given, where xi Rp is the

ith observation and yi is the corresponding label of xi , yi R in regression and

yi {1, 1} in classification. In general, a function f (x) is built to model the

functional relationship between x and y (Note that in classification, the class label is

obtained by sgn (f (x)), where sgn(.) is the sign function.). f (x) can be modeled as

a linear function of x, i.e.,

f (x) = wT x + b, (1.3)

where w Rp is a weight vector and b R is an offset. However, this linear model

fails to capture the nonlinearity that usually exists in the data. In this case, kernel

methods can be used to model a nonlinear function.

5
Figure 1.1: This figure illustrates the idea of kernel methods. The data in the original
space is nonlinearly separable. Using a mapping function (.), the data can be mapped to
a higher dimensional space where the data becomes linearly separable.

Let (.) : Rp F be a function defining a kernel mapping which maps the data

in the original space to the kernel space defined by F . Then (1.3) can be rewritten

as
T
f (x) = w (x) + b, (1.4)

where w is the weight vector in the kernel space. Unfortunately, the dimensionality

of F may be too large, which makes it difficult to work with the explicit features

in the kernel space. To bypass this problem, the kernel trick [84] is generally used.

Specifically, from the Representers Theorem [96], the weight vector w can be defined

as a linear combination of the samples in the kernel space (X) with the coefficient

vector , i.e.,

w = (X), (1.5)

6
where (X) = ((x1 ), ..., (xn )) and Rn . Substituting (1.5) into (1.3), we get

f (x) =T (X)T (x) + b

=T k(x) + b
n
X
= i hxi , xi + b, (1.6)
i=1

where hxi , xi is the inner product of (xi ) and (x), i.e., hxi , xi = (xi )T (x). We

thus see that the model f (x) just derived is linear in the kernel space but nonlinear

in the original one. Therefore, by specifying an appropriate kernel mapping function,

the nonlinearity of the original data is eliminated and a linear approach can be used

in the kernel space to efficiently solve the problem.

There are a variety of algorithms in kernel methods. Kernel Discriminant Analysis

(KDA) [5, 67] is one of the most used kernel methods. KDA is a kernel extension

of LDA. It aims to simultaneously maximize the between-class scatter and minimize

the within-class scatter of the data in the kernel space. Ideally, if the kernel function

and associated parameters are set appropriately, the class distributions will become

homoscedastic in the kernel space and the smallest classification error (i.e., Bayes

error) can be obtained from the resultant linear Bayes classifier. The performance of

a KDA classifier is illustrated in Fig. 1.2.

Kernel Support Vector Machine (KSVM) [92] is another kernel approach popularly

used in pattern recognition. Unlike the DA-based approach, KSVM does not make

any assumptions on the underlying class distributions. Instead, it is a discriminative

approach which directly maximizes the margin between the samples defining the two

classes. In general, the larger the margin, the better the generalization performance

is. This is supported by the principle of structure risk minimization [92].

7
(a) (b) (c)

Figure 1.2: Here we show an example of two non-linearly separable class distributions,
each consisting of 3 subclasses. (a) Classification boundary of LDA. (b) SDAs solution. (c)
KDAs solution.

The kernel mapping is a key process in kernel-based approaches. Different kernel

mappings characterize different representations of the data distributions in the kernel

space, thus requiring different learning models. A kernel mapping can be specified

by a parameterized kernel function, and different kernel functions specify distinct

mappings to the kernel space. For instance, a Gaussian RBF kernel characterizes a

local mapping, whereas a polynomial kernel characterizes a global mapping, Figure

1.3. An appropriately selected kernel function may greatly improve the algorithm

performance. However, one usually does not have any prior knowledge of which

kernel should be selected given a problem at hand.

Even when a kernel function is determined, the process of selecting the parameters

of the kernel which map the original nonlinear problem to a linear one still remains

a big challenge. Kernel parameters play a significant role in the kernel mapping

process. Each kernel parameter specifies a model for the problem to be solved. Thus,

kernel parameter selection is also equivalent to a model selection problem. It is always

desirable that a model could achieve a good bias and variance trade-off, according to

8
(a) (b)

Figure 1.3: Here we show an example of two kernel mappings. (a) The Gaussian RBF
kernel. is the kernel parameter. The kernel value measuring the sample similarity on x
is determined by the nearby samples of x. (b) The polynomial kernel. d is the degree of
the kernel. The kernel value measuring the sample similarity on x is determined by all the
samples.

the bias and variance decomposition [42]. If the the model is made too complex, an

over-fitting to the training data may occur, i.e., low bias and high variance. Whereas if

the model is made too simple, it may under-fit the data and will thus not effectively

capture the underlying structure of the data [42], i.e., high bias and low variance.

Unfortunately, without prior knowledge on the data, it is not easy to select good

kernel parameters. Therefore, model selection becomes a fundamental problem to be

solved in kernel methods.

In this dissertation, we give a comprehensive study of the model selection problem

in kernel methods and propose several novel approaches to address this problem. We

cast the problem into two typical scenarios: classification and regression. In the

section to follow, we give a literature review of the model selection approaches in

kernel methods.

9
1.2 Literature review

Model selection in kernel methods has been a very active and popular research

area. Kernel-based approaches are very powerful due to their high generalization per-

formance and efficiency using the kernel trick. Although promising, a main problem

cannot be circumvented, that is, how to learn a good kernel mapping to adapt to the

data at hand. In general, different kernel mappings lead to different generalization

performances.

In the literature, various approaches for kernel learning have been proposed. Gen-

erally, they can be divided into three classes. The first class of approaches is to learn

the kernel parameters given a parameterized kernel function. In the second class of ap-

proaches, a kernel matrix is directly learned without pre-specifying a kernel function,

and a positive semi-definiteness constraint has to be imposed. One typical approach

in this class is multiple kernel learning, where some basis kernels are first built and

then the final kernel is constructed as a linear or nonlinear combination of these basis

kernels. In the third class of approaches, instead of using some traditional kernel

function, some new kernel functions are proposed to specifically tackle the problem

at hand and expected to perform better. In the following, we will give a review for

each class of approaches in detail.

1.2.1 Kernel parameter selection

One of the most commonly used kernel parameter selection methods is cross-

validation (CV) technique [88, 42]. In this approach, the training data is divided into

k parts: (k 1) of these are used for training the algorithm with distinct values of the

parameters of the kernel, and the remaining one for validating which of these values

10
results in higher classification or prediction accuracy. This method has four major

drawbacks. First, it is computational expensive. The training stage has to be repeated

k times, and the parameter selection is based on an exhaustive search. Second, only

part of the training data is used in each fold. When doing model selection, one wants

to employ the largest possible number of training samples, since this is known to

yield better generalizations [63]. Third, it only selects the parameters from a set of

discrete values and a careful range of the parameters should be pre-specified. Finally,

the selection of k can be an issue, since it affects the trade-off between bias and

variance of the corresponding estimator [42]. In particular, if k is small, the model

may not capture the underlying structure of the data; if k is large, the model would

have a good chance to overfit the training data and result in a poor generalization

performance.

An alternative to CV is Generalized CV (GCV) [37, 96], an approach originally

defined to select the ridge parameter in ridge regression. GCV can be directly ex-

tended to do model selection with kernel approaches, as long as the hat matrix [37],

which projects the original response vector to the estimated one, can be obtained.

However, since this criterion is an approximation of the leave-one-out CV (i.e., n-fold

CV, where n is the number of training samples), the estimated result generally has

a large variance, i.e., the learned function is highly variant and dependent of the

training data, since in each fold almost the same data is used to train the model.

For classification, a popular group of methods for kernel parameter selection is

based on the idea of the between-within class ratio as Fisher had originally proposed

for LDA [28]. Here, we will refer to this as the Fisher criterion. Wang et al. [98] and

Xiong et al.[108] define such a criterion, which maximizes the between-class scatter

11
and minimizes the within-class scatter in the kernel space, to optimize the kernel

parameter. This criterion maximizes the class separability in the kernel space, and

it is shown generally to obtain better classification performance than CV. Similarly,

Wang et al. [97] develop another version of the Fisher criterion, defined as the trace

of the ratio between the kernel versions of the between-class scatter matrix and the

within-class scatter matrix (a.k.a. discriminant power). Due to the difficulty of direct

calculation of the discriminant power, they employ an approximated measure based

on a decomposition of the discriminant power [63]. In [49], the Fisher criterion is

reformulated as a convex optimization problem and then used to find a solution over

a convex set of kernels. Alternatively, Cristianini et al. [17] define the concept of

kernel alignment to capture the agreement between a kernel and the target data. It

is shown how this measure can be used to optimize the kernel. However, Xiong et

al. [108] show that this kernel-target alignment criterion is equivalent to maximizing

the between-class scatter, provided that the kernel matrix has been centralized and

normalized by its Frobenius norm. The major drawback with these criteria is that

they are only based on the measures of class separability. Note that the measure

for class separability is not always related to the classification error. For example,

since the Fisher criterion is based on a least-squares formulation [40], this can easily

over-weight the influence of the classes that are farthest apart [58], i.e., the classifier

will be biased to those classes which are already well separated.

Another solution is to come up with an approximation, usually an upper bound,

for the expected generalization error. Then optimization schemes are used to minimize

such approximations to select the kernel parameters. Cristianini et al. [16] optimize

the kernel parameters by minimizing an upper bound on the generalization error as

12
provided by the Vapnik-Chervonenkis (VC) theory. This upper bound depends on

the radius of the smallest ball containing the training set in the feature space and

the margin between the two classes. They propose a method to dynamically adjust

the kernel parameter during the SVM learning process to find the optimal kernel

parameter which provides the best possible upper bound on the generalization error.

Chapelle et al. [11] optimize the kernel parameters by minimizing different upper

bounds on the error in the leave-one-out procedure which is proved to provide an

almost unbiased estimate of the expected generalization error. The kernel parameters

are optimized by gradient descent methods. However, these approaches have some

limitations. Usually, it is not clear whether these upper bounds are tight enough to

give a good estimate. Moreover, the estimate of the leave-one-out error based on

which bounds are derived may have high variance [42], which may deteriorate the

selection of the kernel parameters.

In another group of methods, the kernel parameters are selected by maximizing

the marginal data likelihood after reformulating the learning problem as probabilis-

tic models. Well-known approaches in this group are the Relevance Vector Machine

(RVM) [91] and the Gaussian processes [79]. RVM uses Bayesian inference to ob-

tain parsimonious solutions for regression and classification. The learning is based

on a type of Expectation-Minimization (EM) method and only local minima could

be found. Gaussian processes provide probabilistic predictions to the test samples

using the Bayesian inference framework. The hyperparameters used in the mean and

covariance functions can be directly estimated by maximizing the marginal data like-

lihood. Gold and Sollich [35] give a probabilistic interpretation of SVM classification

by introducing the application of Bayesian methods to SVM. The SVM classifier is

13
then viewed as the maximum a posteriori (MAP) solution of the corresponding prob-

abilistic inference problem. Then, the kernel parameters in SVM are optimized by

maximizing the data likelihood. Glasmachers and Igel [34] propose a likelihood func-

tion of the kernel parameters to robustly estimate the class conditional probabilities

based on logistic regression, and kernel parameters are optimized by the maximiza-

tion of this likelihood function using gradient ascent. A major drawback of these

approaches is that since Bayesian learning generally leads to analytically intractable

posteriors, some approximation of the posteriors has to be made. This turns out

to be computationally very expensive. In addition, to estimate the priors of the

hyperparameters does not have a clear solution.

1.2.2 Kernel matrix learning

The approaches for kernel parameter learning need to specify a known parame-

terized kernel function. However, given the data at hand, one usually does not have

prior knowledge of which kernel function should be used. Different kernel functions

characterize different functional mappings, thus resulting in different performances.

Rather than learning the kernel parameters of a given kernel function, one could try

to directly learn the kernel matrix, which encodes the similarity of all the training

samples.

Liu et al. [54] propose to learn a (so-called) optimal neighborhood kernel matrix by

assuming that the pre-specified kernel matrix generated from the specific application

is a noisy observation of the ideal one. Kernel learning is then based on minimizing

the difference of the pre-specified kernel matrix and the learned one. Yeung et al.

[112] propose a method for learning the kernel matrix based on maximizing a class

14
separability measure. While a single kernel is known to be not sufficient to describe

the data, multiple kernel learning (MKL) has attracted much attention recently [51,

87]. In [51], the kernel matrix is obtained as a linear combination of pre-specified

base kernels and the optimal coefficients can be determined by using semidefinite

programming, a branch of convex optimization that deals with the optimization of

convex functions over the convex combination of positive semidifinite matrices. Wang

et al. [101] present an alternative approach to MKL. The input data is first mapped

into m different kernel spaces by m different kernel functions and each generated

kernel space is taken as one view of the input space. Then, by using Canonical

Correlation Analysis (CCA), a technique that maximally correlates the m views, a

regularization framework is proposed to guarantee the agreement of the multiview

outputs. Yet, the selection of the base kernel functions and associated parameters is

still an important issue and remains an open problem.

MKL is also applied to regression problems. In [76], MKL is applied to Support

Vector Regression (SVR). The coefficients that determine the combination of kernels

are learned using a constrained quadratic programming problem. This method was

shown to outperform CV in some applications. In another approach, the kernel pa-

rameters are selected by maximizing the marginal data likelihood after reformulating

the regression problem as probabilistic models using Bayesian inference. This ap-

proach has been used to define the well-known Relevance Vector Machine (RVM) [91]

and Gaussian processes for regression [104].

One of the disadvantages of the aforementioned approaches is that algorithms for

learning a kernel matrix often scale poorly, with running times that are cubic in the

number of the training samples; thus the application of these algorithms to large-scale

15
data-sets is limited. Moreover, the multiple kernel learning approach suffers from two

fundamental limitations. First, an explicit formulation to combine different kernels

has to be pre-specified. As it is common, some methods work best in one application

while others outperform it in different settings. Second, the kernel matrix can only

be searched within the space defined by these pre-specified functions. If the kernels

and their parameters are not appropriately specified, the learned kernel matrix will

not perform well in classification and regression.

1.2.3 New kernel development

A main issue of kernel methods is the selection of the kernel functions. Each kernel

characterizes a particular mapping, thus can be used in particular applications. An

appropriately selected kernel function for a given problem could result in a substantial

improvement of the generalization performance.

Although the popularly used kernels, such as the Gaussian RBF kernel and the

polynomial kernel, have shown successful performance in some applications, they have

some known limitations. For instance, the input sample should be in a vector form.

However, in many applications, the input samples could be an ensemble of vectors

and each vector couldhave different lengths. A good example for this type of data

is the protein sequence data. Jaakkola et al. [46] propose a Fisher-based kernel to

detect remote protein homologies. A probabilistic model for each protein sequence is

first built, then the Fisher score, which measures the gradient of the log-likelihood of

the model, is used to represent the sequence sample. Then the similarity between the

two sequences is measured by the inner product of the corresponding Fisher scores. A

good feature of the Fisher kernel is that it combines an underlying generative model

16
and discriminant classifiers (SVM) in the feature space. Similarly, Moreno et al. [69]

develop a Kullback-Leibler (KL) divergence-based kernel for the use of multimedia

applications. Each multimedia object (a sequence of vectors) is modeled as a Gaussian

distribution, and an intermediate space mapping the object to its probablity density

function (pdf) is constructed. The new kernel is evaluated based on the KL divergence

of the two pdfs. Wolf and Shashua [105] derive a more generic kernel for the instances

defined over a space of sets of vectors. Each sample object (a set of vectors) is viewed

as a linear subspace and the kernel is evaluated by measuring the principal angles

between two linear subspaces. This kernel is successfully applied to face recognition

from video.

Some kernels have been developed to be used in some particular applications. For

instance, Odone et al. [72] propose two kernels which are used for images. The images

are first represented as binary strings and then a kernel, as a similarity measure, is

used to operate on them. They further show that the image similarity measures given

by a histogram intersection and the Hausdorff distance can be modified to serve as

kernels. For text classification, Lodhi et al. [55] propose a string kernel to encode the

similarity between the strings. The kernel is generated by using all the subsequences

of length k. Each subsequence forms a dimension in the feature space and weighted

by an exponentially decaying factor of their full length in the text, thus emphasizing

those occurrences that are close to contiguous.

1.3 Research Contributions

From the literature review of model selection in kernel methods, several important

questions are raised. First, in classification, the original goal of a kernel methods is to

17
find a mapping such that the samples in the kernel space could be linearly classified.

To our surprise, no approach thus far has explicitly solved this problem. In other

words, the classifier in the kernel space is not ensured to be linear. Thus, our goal is

to define a first criterion for kernel optimization such that the linear classifier in the

kernel space can be obtained.

Second, in a kernel-based regression problem, model selection plays a significant

role in the regression performance. How to achieve a good balance between the

model fit and model complexity remains a big challenge. We propose an approach

for model selection by adopting multiobjective optimization. By doing so, the model

fit is reduced while the model complexity is kept in check. Finally, in the multiple

kernel learning approaches, an explicit combination of different kernels should be pre-

specified. Is there a way to learn a kernel matrix without specifying an explicit kernel

combination? We explore this idea by using Genetic Algorithms.

In this dissertation, we propose approaches for model selection in kernel methods

in supervised learning. Our approaches are theoretically justified and have been suc-

cessfully used in several applications. In particular, contributions of this dissertation

are as follows:

We develop two criteria to optimize the kernel parameters given a kernel func-

tion based on the idea of Bayes optimality. In the first criterion, kernel pa-

rameters are optimized such that the classification in the kernel space is Bayes

optimal. Thus, this solves the original goal of the kernel mapping: the class

distributions in the kernel space are linearly separable. We achieve this by max-

imizing the homoscedasticity and separability of the pairwise class distributions

18
simultaneously in the kernel space. We further relax the single Gaussian as-

sumption for class distributions by using a mixture of Gaussians, thus allowing

more flexibility in modeling the distributions. In the second criterion, instead

of searching for a linear classifier, we directly minimize the Bayes error over all

the kernel mappings. Specifically, we present an effective measure to approxi-

mate the Bayes accuracy (defined as one minus Bayes error) in the kernel space.

Then optimal kernel is then learned by maximizing this Bayes accuracy over all

kernel representations. Both criteria are shown to outperform the state of the

art kernel optimization approaches.

We propose a model selection framework in kernel-based regression methods.

In this framework, model fit and model complexity in the kernel space are first

directly derived from a decomposition of the generalization error of the learned

function. Then multiobjective optimization is employed to learn a good balance

between model fit and model complexity. A modified -constraint approach is

designed such that the Pareto-optimal solution can be achieved. We further

show that our approach can not only learn the kernel parameters, but those of

a kernel-based regression method.

Since a pre-specified kernel function is not appropriate for a general problem,

we propose to directly learn a kernel matrix using Genetic Algorithm (GA). By

doing so, we eliminate the need for defining a unique way of combining different

kernel matrices, thus allow more flexibility in modeling a general problem. To

achieve our goal, we define a novel representation used in genetic algorithm.

The kernel matrices are then iteratively modified until the matrix providing the

19
smallest classification error is obtained. To map test feature vectors, we define

a regression-based approach to determine the underlying function represented

by the selected kernel matrix. We provide comparative results against the state

of the art methods including multiple kernel learning and transductive learning.

The results shows the superiority of the proposed approach. We further extend

our method to work with regression and demonstrate its effectiveness.

We propose a family of kernels called Local-density Adaptive kernels. Such

kernels measure the sample similarities by taking into account local density

information. The shape of likelihood evaluation in the proposed kernels can

adaptively vary for different local regions based on a measure of the weighted

local variance. Also, the shape varies in an implicit way such that they are

ensured to be Mercer kernels (i.e., positive semi-definite kernels). The proposed

kernels are shown to perform better than the traditional fixed-shape kernels like

Gaussian RBF kernel and Mahalanobis kernel in several applications.

The rest of this dissertation is organized as follows. The first two criteria are

presented in Chapter 2. Chapter 3 derives a model selection framework based on

multiobjective optimization in regression. In Chapter 4, we derive the Local-density

Adaptive kernels. In Chapter 5, we propose a genetic-based approach to learn a kernel

matrix for both classification and regression. Conclusions and future work are given

in Chapter 6.

20
CHAPTER 2

KERNEL LEARNING IN DISCRIMINANT ANALYSIS

2.1 Introduction

Discriminant Analysis (DA) is one of the most popular approaches for feature ex-

traction with broad applications in, for example, computer vision and pattern recog-

nition [32], gene expression analysis [63] and paleontology [62]. The problem with

DA algorithms is that each of them makes assumptions on the underlying class dis-

tributions. That is, they assume the class distributions are homoscedastic, i = j ,

i, j. This is rarely the case in practise. To resolve this problem, one can first map

the original data distributions (with unequal covariances) into a space where these

become homoscedastic. This mapping may however result in a space of very large

dimensionality. To prevent this, one usually employs the kernel trick [84, 96]. In the

kernel trick, the mapping is only intrinsic, yielding a space of the same dimensionality

as that of the original representation while still eliminating the nonlinearity of the

data by making the class distributions homoscedastic. This is the underlying idea in

Kernel DA (KDA) [67, 5] and variants [110, 109, 40].

The approach described in the preceding paragraph resolves the problem of nonlin-

early separable Normal distributions, but still assumes each class can be represented

by a single Normal distribution. In theory, this can also be learned by the kernel,

21
since multimodality includes nonlinearities in the classifier. In practise however, it

makes the problem of finding the appropriate kernel much more challenging. One way

to add flexibility to the kernel is to allow for each class to be subdivided into several

subclasses. This is the underlying idea behind Subclass DA (SDA) [116]. However,

while SDA resolves the problem of multimodally distributed classes, it assumes that

these subclass divisions are linearly separable. Note that SDA can actually resolve

the problem of nonlinearly separable classes as long as there is a subclass division

that results in linearly separable subclasses yielding a non-linear classifier. The ap-

proach will fail when there is no such division. To resolve this problem, we require to

derive a subclass-based approach that can deal with nonlinearly separable subclasses

[12]. This can be done with the help of a kernel map. In this approach, we need

to find a kernel which maps the subclass division into a linearly separable set. We

refer to this approach as Kernel SDA (KSDA). Note that KSDA has two unknowns

the number of subclasses and the parameter(s) of the kernel. Hence, finding the

appropriate kernel parameters will generally be easier, a point we will formally show

in the present chapter.

The kernel parameters are the ones that allow us to map a nonlinearly separable

problem into a linear one [84]. Surprisingly, to the best of our knowledge, there is

not a single method in kernel DA designed to find the kernel parameters which map

the problem to a space where the class distributions are linearly separable. To date,

the most employed technique is k-fold cross-validation (CV). In CV, one uses a large

percentage of the data to train the kernel algorithm. Then, we use the remaining

(smaller) percentage of the training samples to test how the classification varies when

we use different values in the parameters of the kernel. The parameters yielding the

22
highest recognition rates are kept. More recently, [98, 49] showed how one can employ

the Fisher criterion (i.e., the maximization of the ratio between the kernel between-

class scatter matrix and the kernel within-class scatter matrix) to select the kernel

parameters. These approaches aim to maximize classification accuracy within the

training set. However, neither of them aims to solve the original goal of the kernel

map to find a space where the class distributions (or the samples of different classes)

can be separated linearly. Moreover, the Fisher criterion is based on the measures

of class separability. Note that the measure for the class separability is not always

related to the classification error.

In this chapter, we propose two approaches to learn the kernel parameters given

a kernel function. First, we derived an approach whose goal is to specifically map

the original class (or subclass) distributions into a kernel space where these are best

separated by a hyperplane (w.r.t. Bayes). The proposed approach also aims to

maximize the distance between the distributions of different classes, thus maximizing

generalization. We apply the derived approach to three kernel versions of DA, namely

LDA, Nonparametric DA (NDA) and SDA. We show that the proposed techniques

generally achieves higher classification accuracies than the CV and Fisher criteria

defined in the preceding paragraph. In the second approach, we derive a criterion for

selecting the parameters by minimizing the Bayes classification error. To achieve this,

we define a function measuring the Bayes accuracy (i.e., one minus the Bayes error)

in the kernel space. We then show how this function can be efficiently maximized

using gradient ascent. It should be emphasized that this objective function directly

minimizes the classification error, which makes the proposed criterion very powerful.

We will also illustrate how we can employ the same criterion for the selection of other

23
parameters in discriminant analysis. In particular, we demonstrate the uses of the

derived criterion in the selection of the kernel parameters and the number of subclasses

in KSDA. Before we present the derivations of our approaches, we introduce a general

formulation of DA common to most variants. We also derived kernel versions for NDA

and SDA.

2.2 The metrics of discriminant analysis

DA is a supervised technique for feature extraction and classification. Theoreti-

cally, its advantage over unsupervised techniques is given by it providing that repre-

sentation where the underlying class distributions are best separated. Unfortunately,

due to the number of possible solutions, this goal is not always fulfilled in practice

[63]. With infinite time or computational power, one could always find the optimal

representation. With finite time and resources, it is generally impossible to account

for all the possible linear combinations of features, let alone a set of nonlinear com-

binations. This means that one needs to define criteria that can find an appropriate

solution under some general, realistic assumptions.

The least-squares extension of Fishers criterion [28, 32] is arguably the most

known. In this solution, LDA employs two symmetric, positive semi-definite matri-

ces, each defining a metric [63]. One of these metrics should measure within-class

differences and, as such, should be minimized. The other metric should account for

between-class dissimilarity and should thus be maximized. Classical choices for the

first metric are the within-class scatter matrix SW and the sample covariance matrix

X , while the second metric is usually given by the between-class scatter matrix SB .
Pn
The sample covariance matrix is defined as X = n1 i=1 (xi ) (xi )T , where

24
Pn
X = {x1 , . . . , xn } are the n training samples, xi Rp , and = n1 i=1 xi is the sam-
PC
ple mean. The between-class scatter matrix is given by SB = i=1 pi (i ) (i )T ,
P ni
where i = n1
i j=1 xij is the sample mean of class i, xij is the j th sample of class i,

ni is the number of samples in that class, C is the number of classes, and pi = ni /n

is the prior of class i. LDAs solution is then given by the generalized eigenvalue de-

composition equation 1
X SB V = V, where the columns of V are the eigenvectors,

and is a diagonal matrix of corresponding eigenvalues.

To loosen the parametric restriction on the above defined metrics, Fukunaga and

Mantock defined NDA [31], where the between-class scatter matrix is changed to a
PC PC Pni
non-parametric version, Sb = i=1 j=1 l=1 ijl (xil jil )(xil jil )T , where jil is the
j6=i
sample mean of the k-nearest samples to the samples xil that do not belong to class i,

and ijl is a scale factor that deemphasizes large values (i.e. outliers). Alternatively,

Friedman [30] proposed to add a regularizing parameter to the within-class measure,

allowing for the minimization of the generalization error. This regularizing parame-

ter can be learned using CV, yielding the method Regularized DA (RDA). Another

variant of LDA is given by Loog et al. [58], who introduced a weighted version of

the metrics in an attempt to downplay the roles of the class distributions that are

farthest apart. More formally, they noted that the above introduced Fisher criterion
 1  
PC1 PC T T
for LDA can be written as i=1 j=i+1 pi pj ij tr V SW V V ij V , where

ij = (i j )(i j )T , and ij are the weights. In Fishers LDA, all ij = 1. Loog

et al. suggest to make these weights inverse proportional to their pairwise accuracy

(defined as one minus the Bayes error). Similarly, we can define a weighted version
PC P nc P nc
of the within-class scatter matrix SW = c=1 k=1 l=1 ckl(xck xcl )(xck xcl )T .

25
In LDA, ckl are all equal to one. In its weighted version, ckl are defined ac-

cording to the importance of each sample in classification. Using the same no-

tation, we can also define a nonparametric between-class scatter matrix as SB =


PC1 Pni PC P nk
i=1 j=1 k=i+1 l=1 ijkl (xij xkl )(xij xkl )T , where ijkl are the weights. Note

that in these two definitions, the priors have been combined with the weights to

provide a more compact formulation.

All the methods introduced in the preceding paragraphs assume the class distribu-

tions are unimodal Gaussians. To address this limitation, Subclass DA (SDA) [116]

defines a multimodal between-subclass scatter matrix,

C1
XX Hi Hk
C X
X
B = pij pkl (ij kl )(ij kl )T , (2.1)
i=1 j=1 k=i+1 l=1

where pij = nij /n is the prior of the j th subclass of class i, nij is the number of

samples in the j th subclass of class i, Hi is the number of subclasses in class i,


1 Pnij
ij = nij k=1 xijk is the sample mean of the j th subclass in class i, and xijk denotes

the k th sample in the j th subclass in class i.

The algorithms summarized thus far assume the class (or subclass) distributions

are homoscedastic. To deal with heteroscedastic (i.e., non-homoscedastic) distribu-

tions, [57] defines a within-class similarity metric using the Chernoff distance, yielding

an algorithm we will refer to as Heteroscedastic LDA (HLDA). Alternatively, one can

use an embedding approach such as Locality Preserving Projection (LPP) [43]. LPP

finds that subspace where the structure of the data is locally preserved, allowing for

nonlinear classifications. An alternative to these algorithms is to employ a kernel

function which intrinsically maps the original data distributions to a space where

these adapt to the assumptions of the approach in use. KDA [67, 5] redefines the

26
within- and between-class scatter matrices in the kernel space to derive feature ex-

traction algorithms that are nonlinear in the original space but linear in the kernel

one. This is achieved by means of a mapping function (.) : Rp F . The sam-

ple covariance and between-class scatter matrices in the kernel space are given by
Pn PC   T

X = n
1
i=1 ((xi ) )((xi ) )T and S
B = i=1 pi i i ,
Pn P ni
where = 1
n i=1 (xi ) is the kernel sample mean, and i = 1
ni j=1 (xij ) is the

kernel sample mean of class i.

Unfortunately, the dimensionality of F may be too large. To bypass this problem,

one generally uses the kernel trick, which works as follows. Let A and B be

two metrics in the kernel space and V the projection matrix obtained by A V =

B V . We know from the Representers Theorem [96] that the resulting projection

matrix can be defined as a linear combination of the samples in the kernel space (X)

with the coefficient matrix , i.e., V = (X). Hence, to calculate the projection

matrix, we need to obtain the coefficient matrix by solving A = B , where

A = (X)T A (X) and B = (X)T B (X) are the two metrics that need to be

maximized and minimized. Using this trick, the metric for


X is given by B
X
=
Pn
(X)T
X (X) = n
1
i=1 (X)T ((xi ) )((xi ) )T (X) = n1 K(I Pn )K,

where K = (X)T (X) is the kernel (Gram) matrix and Pn is the n n matrix with

each of its element equal to 1/n.


PC PC
Similarly, BSW = 1
C i=1 (X)T
i (X) =
1
C
1 T
i=1 ni Ki (I Pni )Ki , where
i =

1 Pni T
ni j=1 ((xij )i )((xij )i ) is the kernel within-class covariance matrix of class

i, and Ki = (X)T (Xi ) is the subset of the kernel matrix for the samples in class i.
PC
The metric for S
B can be obtained as AS
B
= i=1 pi (Ki1ni K1n )(Ki 1ni K1n )T ,

where 1ni is a vector with all elements equal to 1/ni . The coefficient matrix for KDA

27
is given by B1
KDA AKDA KDA = KDA KDA , where BKDA can be either B
X
or BSW

and AKDA = ASB .

We can similarly derive kernel approaches for the other methods introduced above.

For example, in Kernel NDA (KNDA), the metric A is obtained by defining its

corresponding scatter matrix in the kernel space as

AKN DA = (X)T S
b (X)
C X
X ni
C X

= ijl (kil Mjil 1k )(kil Mjil 1k )T ,
i=1 j=1 l=1
j6=i

where kil = (X)T (xil ) is the kernel space representation of the sample xil , Mjil =

(X)T (Xjil ) is the kernel matrix of the k-nearest neighbors of xil , Xjil is a matrix

whose columns are the k-nearest neighbors of xil , and ijl is the normalizing factor

computed in the kernel space.

Kernel SDA (KSDA) maximizes the kernel between-subclass scatter matrix


B

[12]. This matrix is given by replacing the subclass means of (2.1) with the kernel
Pnij
subclass means ij = n1
ij k=1 (xijk ). Now, we can use the kernel trick to obtain

the matrix to be maximized, AKSDA =

C1
XX Hi Hk
C X
X
pij pkl (Kij 1ij Kkl 1kl )(Kij 1ij Kkl 1kl )T ,
i=1 j=1 k=i+1 l=1

where Kij = (X)T (Xij ) is the kernel matrix of the samples in the j th subclass of

class i, and 1ij is a nij 1 vector with all elements equal to 1/nij .

If we are to successfully employ the above derived approaches in practical settings,

it is imperative that we define criteria to optimize these parameters. The classical

approach to determine the parameters of the kernel is CV, where we divide the train-

ing data into k parts: (k 1) of them for training the algorithm with distinct values

28
for the parameters of the kernel, and the remaining one for validating which of these

values results in higher (average) classification rates. This solution has three major

drawbacks. First, the kernel parameters are only optimized for the training data, not

the distributions [117]. Second, CV is computationally expensive and may become

very demanding for large data-sets. Third, not all the training data can be used to op-

timize the parameters of the kernel. To avoid these problems, [98] defines a criterion

to maximize the kernel between-class difference and minimize the kernel within-class

scatter as Fisher had originally proposed but now applied to the selection of the

kernel parameters. This method was shown to yield higher classification accuracies

than CV in a variety of problems. A related approach [49] is to redefine the kernelized

Fisher criterion as a convex optimization problem. Alternatively, Ye et al. [111] have

proposed a kernel version of RDA where the kernel is learned as a linear combination

of a set of pre-specified kernels. However, these approaches do not guarantee that

the kernel or kernel parameters we choose will result in homoscedastic distributions

in the kernel space. This would be ideal, because it would guarantee that the Bayes

classifier (which is the one with the smallest error in that space) is linear.

In the sections to follow, we will present our approaches in kernel optimization.

2.3 Homoscedastic criterion

The goal of the first criterion is to find a kernel which maps the original class

distributions to homoscedastic ones while keeping them as far apart from each other

as possible. This criterion is related to the approach presented in [41] where the goal

was to optimize a distinct version of homoscedasticity defined in the complex sphere.

29
The criterion we derive here could be extended to work in the complex sphere and is

thus a more general approach.

2.3.1 Maximizing homoscedasticity

To derive our homoscedastic criterion, we need to answer the following question.

What is a good measure of homoscedasticity? That is, we need to define a criterion

which is maximized when all class covariances are identical. The value of the criterion

should also decrease as the distributions become more different. We now present a

key result applicable to this end.

Theorem 1. Let
i and j be the kernel covariance matrices of two Normal dis-
tr(
i j )
tributions in the kernel space defined by the function (.). Then, Q1 = tr(
2 2
i )+tr(j )

takes the maximum value of .5 when


i = j , i.e., when the two Normal distribu-

tions are homoscedastic in the kernel space.

Proof.
i and j are two p p positive semi-definite matrices with spectral decom-
   
T
positions
i = Vi i Vi , where Vi = vi1 , . . . , vip and i = diag i1 , . . . , ip

are the eigenvector and eigenvalue matrices.


2 2
The denominator of Q1 , tr(
i ) + tr(j ), only depends on the selection of the

kernel. For a fixed kernel (and fixed kernel parameters), its value is constant regardless
2 2 2 2
of any divergence between
i and j . Hence, tr(i )+tr(j ) = tr(i )+tr(j ).
T
We also know that tr(
i j ) tr(i j ), with the equality holding when Vi Vj =

I [90], i.e., the eigenvectors of


i and j are not only the same but are in the same

order, vik = vjk . Using these two results, we can write


Pp
m=1 im jm
Q1 Pp 2 Pp 2 .
m=1 im + m=1 jm

30
Now, let us define every eigenvalue of
i as a multiple of those of j , i.e.,

im = km jm , km 0, m = 1, . . . , p. This allows us to rewrite our criterion as


Pp 2
m=1 km jm
Q1 Pp 2
.
2
m=1 jm (km + 1)

From the above equation, we see that Q1 0, since all its variables are positive.

The maximum value of Q1 will be attained when all km = 1, which yields Q1 = .5.

We now note that having all km = 1 implies that the eigenvalues of the two covariance

matrices are the same. We also know that the maximum of Q1 can only be reached

when the eigenvectors are the same and in the same order, as stated above. This

means that the two Normal distributions are homoscedastic in the kernel space defined

by (.) when Q1 = .5.

From the above result, we see that we can already detect when two distributions

are homoscedastic in a kernel space. This means that for a given kernel function,

we can find those kernel parameters which give us Q1 = .5. Note that the closer we

get to this maximum value, the more similar the two distributions ought to be, since

their eigenvalues will become closer to each other. To show this, we would now like

to prove that when the value of Q1 increases, then the divergence between the two

distributions decreases.

Divergence is a classical mechanism used to measure the similarity between two

distributions. A general type of divergence employed to calculate the similarity be-

tween samples from convex sets is the Bregman divergence [8]. Formally, for a given

continuously-differentiable strictly convex function G : Rpp R, the Bregman di-

vergence over real symmetric matrices is defined as

BG (X, Y) = G(X) G(Y) tr(G(Y)T (X Y)), (2.2)

31
where X, Y {Z | Z Rpp , and Z = ZT }, and is the gradient.

Note that the definition given above for the Bregman divergence is very general. In

fact, many other divergence measures (such as the Kullback-Leibler) as well as several

commonly employed distances (e.g. Mahalanobis and Frobenius) are a particular case

of Bregmans. Consider the case where G(X) = tr(XT X), which computes the trace of

the covariance matrix, i.e., the Frobenius norm. In this case, the Bregman divergence

is BG (1 , 2 ) = tr(21 ) + tr(22 ) 2tr(1 2 ), where, as above, i are the covariance

matrices of the two distributions that we wish to compare. We can also rewrite this

result using the covariances in the kernel space as,

2 2
BG (
1 , 2 ) = tr(1 ) + tr(2 ) 2tr(1 2 ),

where now G(X) = tr((X)T (X)).

Note that to decrease the divergence (i.e., the value of BG ), we need to minimize
2 2
tr(
1 ) + tr(2 ) and/or maximize tr(1 2 ). The more we lower the former and

increase the latter, the smaller the Bregman divergence will be. Similarly, when we
2 2
decrease the value of tr(
1 ) + tr(2 ) and/or increase that of tr(1 2 ), we make

the value of Q1 larger. Hence, as the value of our criterion Q1 increases, the Bregman

divergence between the two distributions decreases, i.e., the two distributions become

more alike. This result is illustrated in Fig. 2.1. We can formally summarize this

result as follows.

Theorem 2. Maximizing Q1 is equivalent to minimizing the Bregman divergence

BG (
1 , 2 ) between the two kernel covariance matrices 1 and 2 , where G(X) =

tr((X)T (X)).

32
0.5o

0.45 o
0o 45
0.4
90o
o
Q
1
0.35

0.3 o

0.25
0 20 40 60 80

(a) (b) (c) (d)

Figure 2.1: Three examples of the use of the homoscedastic criterion, Q1 . The examples
are for two Normal distributions with equal covariance matrix up to scale and rotation.
(a) The value of Q1 decreases as the angle increases. The 2D rotation between the two
distributions is in the x axis. The value of Q1 is in the y axis. (b) When = 0o , the
two distributions are homoscedastic, and Q1 takes its maximum value of .5. Note how for
distributions that are close to homoscedastic (i.e., 0o ), the value of the criterion remains
high. (c) When = 45o , the value has decreased about .4. (d) By = 90o , Q1 .3.

We have now shown that the criterion Q1 increases as any two distributions be-

come more similar to one another. We can readily extend this result to the multiple

distribution case,

C1 C
2 X X tr(
i k )
Q1 () = 2 2
, (2.3)
C(C 1) i=1 k=i+1 tr(i ) + tr(k )

where
i is the sample covariance matrix of the i
th
class. This criterion measures

the average homoscedasticity of all pairwise class distributions.

This criterion can be directly used in KDA, KNDA and others. Moreover, the

same criterion can be readily extended to work in KSDA,

Hi X Hk
1 C1
XX C X
tr(
ij kl )
Q1 (, H1, . . . , HC ) = 2 2
,
h i=1 j=1 k=i+1 l=1 tr(ij ) + tr(kl )

where
ij is the sample covariance matrix of the j
th
subclass of class i, and h is the

number of summing terms.

33
The reason we needed to derive the above criterion is because, in the multi-class

case, the addition of the Bregman divergences would cancel each other out. Moreover,

the derived criterion is scale invariant, while Bregman is not.

It may now seem that the criterion Q1 is ideal for all kernel versions of DA. To

study this further, let us define a particular kernel function. An appropriate kernel

is the RBF (Radial Basis Function), because it is specifically tailored for Normal

distributions. We will now show that, although homoscedasticity guarantees that

the Bayes classifier is linear in this RBF kernel space, it does not guarantee that

the class distributions will be separable. In fact, it can be shown that Q1 may

favor a kernel map where all (sub)class distributions become the same, i.e., identical

covariance matrix and mean. Indeed a particular but useless case of homoscedasticity

in classification problems.
 2 
Theorem 3. The RBF kernel is k(xi , xj ) = exp kxi x

jk
, with scale parameter

. In the two class problem, C = 2, let the pairwise between class distances be

{D11 , D12 , . . . , Dn1 n2 }, where Dij = kxi xj k22 is the (squared) Euclidean distance

calculated between two sample vectors, xi and xj , of different classes, and n1 and

n2 are the number of elements in each class. Similarly, let the pairwise within class

distances be {d111 , d112 , . . . , d1n1n1 , d211 , d212 , . . . , d2n2 n2 }, where dckl = kxck xcl k22 is the

Euclidean distances between sample vectors of the same class c. And, use SW with

the normalized weights


 
2dckl
exp
ckl = P2 P nc P nc  2dckl

c=1 k=1 l=1 exp

and SB with the normalized weights


 
exp 2Dij
1i2j = Pn Pn 
2Dij
.
1 2
i=1 j=1 exp

34
Q1
Then, if tr(SB ) > tr(SW ), Q1 (.) monotonically increases with , i.e.,
0.

Proof. Note that both of the numerator and denominator of Q1 can be written in
P P
the form of i j exp (2kxi xj k22 /). Its partial derivative with respect to is,
P P 2kxi xj k22 Q1
i j 2
exp (2kxi xj k22 /). Substituting for Dij and dkl , we have
equal

to
Pn1 Pn2   P2 P nc P nc  
2dckl
i=1 j=1 exp 2Dij 2Dij
2 c=1 k=1 l=1 exp
hP P nc P nc  i2
2 2dckl
c=1 k=1 l=1 exp
P2 Pnc Pnc   Pn1 Pn2  
2dckl 2dckl 2Dij
c=1 k=1 l=1 exp 2 i=1 j=1 exp
hP P nc P nc  2dckl
i2 .
2
c=1 k=1 l=1 exp

We want to know when Q1 / 0, which is the same as


Pn1 Pn2   P2 Pnc Pnc  
2Dij 2dckl
i=1 j=1 exp

Dij c=1 k=1
dckl
l=1 exp
Pn1 Pn2 
2Dij
 > P2 Pnc Pnc  2dc  .
j=1 exp l=1 exp
kl
i=1 c=1 k=1

The left hand side of this inequality is the estimate of the between class variance,

while the right hand side is the within class variance estimate, since Dij and dcij can

be rewritten as the trace of the outer product tr((xi xj )(xi xj )T ). Substituting

for the above defined ckl and 1i2j , we have Q1 / 0 when tr(SB ) > tr(SW ).

Q1
This latest theorem shows that when approaches infinity,
approaches zero

and, hence, Q1 tends to its maximum value of .5. Increasing to infinity in the RBF

kernel will result in a space where the two class distributions become identical. This

will happen whenever tr(SB ) > tr(SW ). This is a fundamental theorem of DA because

it shows the relation between KDA, the weighted LDA version of [58] and the NDA

method of [31]. Theorem 3 shows that these variants of DA are related to the idea

of maximizing homoscedasticity as defined in this chapter. It also demonstrates the

35
importance of the metrics in weighed LDA and NDA. In particular, the above result

proves that if, after proper normalization, the between class differences are larger

than the within class differences, then classification in the kernel space optimized

with Q1 will be as bad as random selection. One indeed wants the class distributions

to become homoscedastic in the kernel space, but not at the cost of classification

accuracy, which is the underlying goal.

To address the problem outlined in Theorem 3, we need to consider a second

criterion which is directly related to class separability. Such a criterion is simply

given by the trace of the between-class (or -subclass) scatter matrix, since this is

proportional to class separability,



  C1
X C
X
Q2 () = tr S
B = tr pi pk (i k )(i k )T
i=1 k=i+1
C1
X C
X
= pi pk ki k k2 . (2.4)
i=1 k=i+1

Again, we can readily extend this result to work with subclasses,


 
Q2 (, H1 , . . . , HC ) = tr
B
C1
XX Hi Hk
C X
X
= pij pkl kij kl k2 .
i=1 j=1 k=i+1 l=1

Since we want to maximize homoscedasticity and class separability, we need to

combine the two criteria of (2.3) and (2.4),

Q(.) = Q1 (.) Q2 (.). (2.5)

The product given above is an appropriate way to combine independent measures of

different magnitude as is the case with Q1 and Q2 .

Using the criterion given in (2.5), the optimal kernel function, , is

= arg max Q().


36
In KSDA, we optimize the number of subclasses and the kernel as

, H1, . . . , HC = arg max Q(, H1 , . . . , HC ).


,H1 ,...,HC

Also, recall that in KSDA (as in SDA), we need to divide the data into subclasses. As

stated above we assume that the underlying class distribution can be approximated

by a mixture of Gaussians. This assumption, suggests the following ordering of the


c = {x , . . . , x }, where x and x are the two most dissimilar feature
samples: X 1 n 1 n

vectors and xk is the k 1th feature vector closest to x1 . This ordering allows us to
c into H parts. This
divide the set of samples into H subgroups, by simply dividing X

approach has been shown to be appropriate for finding subclass divisions [116].

As a final note, it is worth emphasizing that, as opposed to CV, the derived crite-

rion will use the whole data in the training set for estimating the data distributions

because there is no need for a verification set. With a limited number of training sam-

ples, this will generally yield better estimates of the unknown underlying distribution.

The other advantage of the derived approach is that it can be optimized using gra-

dient descent, by taking Q(k(xi , xj ))/. In particular, we employ a quasi-Newton

approach with a Broyden-Fletcher-Goldfarb-Shanno Hessian update [21]. The main

advantage of this method is that it has a fast converge and does not require the

calculation of the Hessian matrix. Instead, the Hessian is updated by analyzing the

gradient vectors. The derivation of the gradient of our criterion is shown in the sec-

tion to follow. The initial value for the kernel parameter is set to be the mean of the

distances between all pairwise training samples.

37
2.3.2 Derivation of the Gradient
 
kxi xj k2
We take (.) to be the RBF function, k(xi , xj ) = exp
, with the

parameter to be optimized. And, we consider the case where each class distribution

is modeled by a single Gaussian distribution. The derivations for the subclass case

follows immediately from the ones given below.

The gradient of our criterion Q(.), when considering the RBF kernel, is given by

Q() (Q1 ()Q2 ()) Q1 () Q2 ()


= = Q2 () + Q1 ().

The partial derivative of Q1 () with respect to the RBF parameter is


2 2
C1 C tr(
i k )
2 2 tr(
i )
tr( )
Q1 () 2 X X

(tr(
i ) + tr(
k )) tr(i k )(
+
k
)
= 2 2 2
C(C 1) i=1 k=i+1 (tr(i ) + tr(k ))

Note that T
i = (Xi )(I 1ni )(Xi ) , where (Xi ) = ((xi1 ), ..., (xini )) and 1ni

is a ni ni matrix with all elements equal to 1/ni . Then, tr(


i k ) = tr((Xi )(I

1ni )(Xi )T (Xk )(I 1nk )(Xk )T ) = tr(Kki (I 1ni )Kik (I 1nk )), where Kik =

(Xi )T (Xk ). Let Kki = Kki(I 1ni ) and Kik = Kik (I 1nk ). We can rewrite this

result as,
XX
tr(KkiKik ) = Kpq qp
ki Kik ,
p q

f pq is the (p, q)th entry of K


where K f . Denote the partial derivative of an m n
ki ki
h i
K Kpq Kpq k(xp ,xq )
matrix K with respect to as
= p=1,...,m,q=1,...,n
, with
=
=
kxp xq k2 2

3
exp( kxp2
xq k
2 ) when using the RBF function. Then,

tr(
i k ) tr(Kki Kik )
=

X X Kpq qp !
ki qp pq Kik
= Kik + Kki
p q
" ! !#
XX Kpq
ki Kpq
ki qp
qp
pq Kik Kqp
ik
= 1i Kik + Kki 1k .
p q

38
Next, note that Q2 () can be written as
C1
X C
X
Q2 () = pi pk dik ,
i=1 k=i+1

where

dik = (i k )T (i k )

= ((Xi )1i (Xk )1k )T ((Xi )1i (Xk )1k )

= 1Ti Kii 1i 21Ti Kik 1k + 1Tk Kkk 1k .

Using this notation, the gradient of Q2 () with respect to is


!
Q2 () C1
X X C
dik C1
X X C
Kii Kik Kkk
= pi pk = pi pk 1Ti 1i 21Ti 1k + 1Tk 1k .
i=1 k=i+1 i=1 k=i+1

This result allows us to iteratively determine an appropriate solution. To see

that the solution found with such a gradient descent technique is an appropriate

one, recall that Theorem 2 showed Q1 monotonically increases if tr(SB ) > tr(SW ).

In most practical problems this condition is satisfied, since otherwise the classes

mostly overlap and the classification problem is not solvable (i.e., there is a very large

classification error in the original feature space). This means there is an identifiable

global maximum. We now note that the same applies to Q2 . That is, as long as the

class distributions do not overlap significantly, Q2 has a unique maximum for a sigma

value in between the averaged within class sample distances and the averaged between

class sample distances. To see this, note that for every Q2 calculated for a pair of

classes (i.e., classes 1 and 2), there are three main components: the sum of the kernel

matrix elements in class 1, in class 2, and between classes 1 and 2. Each of these

components monotonically increases with respect to sigma (starting with 1/n1 , 1/n2 ,

0, and converging to 1). The fastest increases occur for sigma around the averaged

39
distance in that component; e.g., for within class 1, this will be around the averaged

distance of the samples in that class. This means that the within class components

will converge earlier than the between class distances. Hence, the sum of the within

class subtracted with two times the between class elements (in the kernel matrix)

will result in a maximum in between the averaged within class sample distances and

between class sample distances.

In some applications where our conditions may not hold, it would be appropriate

to test a few starting values to determine the best solution. We did not require this

procedure in our experiments.

2.3.3 Generalization

A major goal in pattern recognition is to find classification criteria that have a

small generalization error, i.e., small expected error on the unobserved data. This

mainly depends on the number of samples in our training set, training error and the

model (criterion) complexity [42]. Since the training set is usually fixed, we are left

to select a proper model. Smooth (close to linear) classifiers have a small model com-

plexity but large training error. On the other hand wiggly classifiers may have a small

training error but large model complexity. To have a small generalization error, we

need to select a model that has moderate training error and model complexity. Thus,

in general, the simpler the classifier, the smaller the generalization error. However, if

the classifier is too simple, the training error may be very large.

KDA is limited in terms of model complexity. This is mainly because KDA as-

sumes each class is represented with unimodal distributions. If there is a multimodal

structure in each class, KDA would select wiggly functions in order to minimize the

40
(a) (b) (c) (d)

Figure 2.2: Here we show a two class classification problem with multi-modal class dis-
tributions. When = 1 both KDA (a) and KSDA (b) generate solutions that have small
training error. (c) However, when the model complexity is small, = 3, KDA fails. (d)
KSDAs solution resolves this problem with piecewise smooth, nonlinear classifiers.

classification error. To avoid this, the model complexity may be limited to smooth

solutions, which would generally result in large training errors and, hence, large gen-

eralization errors.

This problem can be solved by using an algorithm that considers multimodal

class representations, e.g., KSDA. While KDA can find wiggly functions to separate

multimodal data, KSDA can find several functions which are smoother and carry

smaller training errors. We can illustrate this theoretical advantage of KSDA with a

simple 2-class classification example, Fig. 2.2. In this figure, each class consists of 2

nonlinearly separable subclasses. Fig. 2.2(a) shows the solution of KDA obtained with

the RBF kernel with = 1. Fig. 2.2(b) shows the KSDA solution. KSDA can obtain a

classification function that has the same training error with smaller model complexity,

i.e., smoother classification boundaries. When we reduce the model complexity by

increasing to 3, KDA leads to a large training error, Fig. 2.2(c). This does not

occur in KSDA, Fig. 2.2(d). A similar argument can be used to explain the problems

faced with Maximum Likelihood (ML) classification when modeling the original data

as a Mixture of Gaussians (MoG) in the original space. Unless one has access to a

41
Figure 2.3: The original data distributions are mapped to different kernel spaces via dif-
ferent mapping functions (.). (2 ) is better than (1 ) in terms of the Bayes error.

sufficiently large set (i.e., proportional to the number of dimensions of this original

feature space), the results will not generalize well.

2.4 Kernel Bayes accuracy criterion

The second criterion we will define in this chapter is directly related to the concept

of Bayes classification error. The idea is to learn the kernel parameters by finding a

kernel representation where the Bayes classification error is minimized across all the

mappings. This is illustrated in Figure 2.3. We start with an analysis of LDA. One

of the drawback of LDA is that its solution is biased toward those classes that are

furthest apart. To see this, note that LDA is based on least-squares (i.e., an eigenvalue

decomposition defined to solve a system of homogeneous equations [40]). Thus, the

42
LDA solution tends to over-weight the classes that were already well-separated in

the original space. In order to downplay the roles of the class distributions that are

farthest apart, [58] introduces a weighted version of SB , defined as


C1
X C
X
B = pi pj (ij )(i j )(i j )T , (2.6)
i=1 j=i+1

where 2ij = (i j )T 1
X (i j ) is the Mahalanobis distance between classes i

and j, : R+ +
0 R0 is a weighting function, (ij ) =
1
22ij
erf ( 2ij2 ), and erf (x) =
Rx 2
2 et dt is the error function.
0

One advantage of (2.6) is that it is related to the mean pairwise Bayes accuracy

[58] (i.e., one minus the Bayes error), since


d C1
X X X C
J(L) = pi pj (ij )tr(eTm ij em ), (2.7)
m=1 i=1 j=i+1
PC1 PC
where L = (e1 , ..., ed ) is the eigenvector matrix of i=1 j=i+1 pi pj (ij )ij , ij =

(i j )(i j )T are the pairwise class distances, and, for simplicity, we have

assumed X = Ip , Ip an identity matrix with dimension p p.

2.4.1 Bayes accuracy in the kernel space

As mentioned above, (2.7) is proportional to the Bayes accuracy and as such it

can be employed to improve LDA [58]. We want to derive a similar function for its

use in the kernel space.

Let (.) : Rp F be a function defining the kernel map. We also assume the data

has already been whitened in the kernel space. Denote the data matrix in the kernel

space (X), where (X) = ((x11 ), . . . , (xini ), . . . , (xCnC )). The kernel matrix is

given by K = (X)T (X).

Using this notation, the covariance matrix in the kernel space can be written as

1 T
X = n (X)(In Pn )(X) , where In is the n n identity matrix, and Pn is a

43
n n matrix with all elements equal to 1/n. The whitened data matrix (X) is now
1 T
given by (X) = 2 V (X), where and V are the eigenvalue and eigenvector

matrices given by
X V = V . We know from the Representers Theorem [96]

that a projection vector lies in the span of the samples in the kernel space (X), i.e.,

V = (X), where is a corresponding coefficient matrix. Thus, we have

1 T
(X) = 2 V (X)
1 1
= 2 T (X)T (X) = 2 T K,

where and can be calculated from a generalized eigenvalue decomposition problem

N = K, with N = n1 K(In Pn )K. With this trick, we transform the kernel

covariance matrix
X into the identity matrix.

Next, define the mean of class i in the kernel space as

i = (Xi )1i , (2.8)

where (Xi ) = ((xi1 ), . . . , (xini )), and 1i is a ni 1 vector with all elements equal

to 1/ni . Let Ki = (X)T (Xi) denote the subset of the whitened kernel matrix for

the samples in class i.

Combining the above results, we can define the Bayes accuracy in the kernel space

as
d C1
X X X C
T
Q() = pi pj (
ij )em Sij em , (2.9)
m=1 i=1 j=i+1

where e1 , ..., ed are the eigenvectors of the weighted kernel between-class scatter ma-

trix
C1
X C
X
pi pj (
ij )Sij ,
i=1 j=i+1

44
T
S
ij = (i j )(i j ) , the Mahalanobis distance ij in the whitened kernel space

becomes the Euclidean distance,

2

ij = (i j )T (i j )

= ((Xi )1i (Xj )1j )T ((Xi )1i (Xj )1j )

= 1Ti Kii 1i 21Ti Kij 1j + 1Tj Kjj 1j , (2.10)

and Kij = (Xi )T (Xj ) is the subset of the kernel matrix for the samples in class i

and j.

From the Representers Theorem [96], we know that ei = (X)ui , where ui is a co-
T
efficient vector. Then, using (2.8) we have em S T
ij em = um Sij um , where Sij = (Ki 1i
PC1 PC
Kj 1j )(Ki 1i Kj 1j )T , and u1 , . . . , ud are the eigenvectors of i=1

j=i+1 pi pj (ij )Sij .

Therefore, criterion (2.9) can be rewritten as


d C1
X X X C
Q() = pi pj ( T
ij )um Sij um . (2.11)
m=1 i=1 j=i+1

By maximizing Q(), we favor a kernel representation where the sum of pairwise

Bayes accuracies is maximized. The optimal kernel function, , is given by

= arg max Q().


We will refer to the derived criterion given in (2.11) as Kernel Bayes Accuracy

(KBA) criterion.

2.4.2 Kernel parameters with gradient ascent

The first application of the above derived criterion is in determining the value

of the parameters of a kernel function. For example, if we are given the Radial
kxi xj k2
Basis Function (RBF) kernel, k(xi , xj ) = exp( 22
), our goal is to determine an

appropriate value of the variance .

45
To determine our solution, we employ a quasi-Newton method with a Broyden-

Fletcher-Goldfarb-Shanno Hessian update [21], yielding a fast convergence.

To compute the derivative of our criterion, note that (2.11) can be rewritten as
C1
X C
X
Q() = tr( pi pj (
ij )Sij )
i=1 j=i+1
C1
X C
X
= pi pj (
ij )tr(Sij ).
i=1 j=i+1

Q()
Taking the partial derivative with respect to in the RBF kernel, we have
=
 
PC1 PC (
ij ) tr(Sij )
i=1 j=i+1 pi pj
tr(Sij ) + (
ij ) .
K
Denote the partial derivative of an m n matrix K with respect to as
=
h i (
Kij K k(xi ,xj ) kx x k2 kx x k2 ij )
i=1,...,m,j=1,...,n
, with ij =
= i 3 j exp( i22j ). Then =
2
erf (
ij /2 2) ij exp(
ij /8) ij K K
3
+
2 2 2
, where
ij
= 21 (1Ti
Kii
1i 21Ti ij 1j +1Tj jj 1j ).
ij ij ij

tr(Sij ) (Ki 1i Kj 1j )T (Ki 1i Kj 1j ) KT KT


Finally,
=
= 1Ti
i
Ki 1i +1Ti KTi
Ki
1i 21Tj
j
Ki 1i
KT Kj
21Tj KTj
Ki
1i + 1Tj
j
Kj 1j + 1Tj KTj j
1.

2.4.3 Subclass extension

Another application of the derived KBA criterion is in determining the number

of subclasses in Subclass Discriminant Analysis (SDA) [116] and its kernel extension.

KDA assumes that each class has a single Gaussian distribution in the kernel space.

However, this may be too restrictive since it is usually difficult to find a kernel rep-

resentation where the class distributions are single Gaussians. In order to relax this

assumption, we can describe each class using a mixture of Gaussians. Using this idea,

we can reformulate (2.11) as


d C1
X XX Hi X Hk
C X
sub
Q (, H1, . . . , HC ) = pij pkl
m=1 i=1 j=1 k=i+1 l=1

( T
ij,kl )um Sij,kl um , (2.12)

46
where Hi is the number of subclasses in class i, u1 , ..., ud are d eigenvectors of the

kernel version of the weighted between-subclass scatter matrix

C1
XX Hi Hk
C X
X
pij pkl (
ij,kl )Sij,kl ,
i=1 j=1 k=i+1 l=1

Sij,kl = (Mij 1ij Mkl 1kl )(Mij 1ij Mkl 1kl )T , Mij = (X)T (Xij ), (Xij ) = ((xij1 ), . . . ,

(xijnij )), xijk is the k th sample of subclass j in class i, 1ij is a nij 1 vector with

all elements equal to 1/nij , and nij the number of samples in the j th subclass of class

i. Note that in the above equation, the whitened Mahalanobis distance is given by

2 T

ij,kl = (ij kl ) (ij kl )

= 1Tij Kij,ij 1ij 21Tij Kij,kl 1kl + 1Tkl Kkl,kl 1kl ,

where Kij,kl = (Xij )T (Xkl ). The optimal kernel function and subclass divisions

are given by

, H1 , . . . , HC = arg max Qsub (, H1 , . . . , HC ).


,H1 ,...,HC

2.4.4 Optimal subclass discovery

In KSDA we are simultaneously optimizing the kernel parameter and the number

of subclasses. It is in fact advantageous to do so, because it will allow us to find

the Bayes optimal solution when the classes need to be described with a mixture

of Gaussians in the kernel space. Furthermore, we can automatically determine the

underlying structure of the data. This last point is important in many applications.

We illustrate this with a set of examples.

In our case study, we generated a set of 120 samples for each of the two classes.

Each class was represented by a mixture of two Gaussians, with mean and diagonal

47
covariance randomly initialized. Then, (2.12) was employed to determine the ap-

propriate number of subclasses and parameter of the RBF kernel. This process was

repeated 100 times, each with a different random initialization of the means and co-

variances. The average of the maxima of (2.12) for each value of Hi (with H1 = H2 )

are shown in Fig. 2.4(a). We see that the derived criterion is on average higher for

the correct number of subclasses. We then repeated the process described in this

paragraph for the cases of 3, 4 and 5 subclasses per class. The results are in Fig.

2.4(b-d). Again, the maximum of (2.12) corresponds to the correct number of sub-

classes. Therefore, the proposed criterion can generally be efficiently employed to

discover the underlying structure of the data. For comparison, in Fig. 2.4(e-h) we

show the plots of the Fisher criterion described earlier. We see that this criterion does

not recover the correct number of subclasses and is generally monotonically increas-

ing, thus, tending to select large values for Hi . This is because the Fisher criterion

maximizes the between-subclass scatter and, generally, the larger Hi , the larger the

scatter.

As a more challenging case, we also consider the well-known XOR data classifi-

cation problem, Fig. 2.5(a). The values of (2.12) for different Hi are plotted in Fig.

2.5(b) and those of the Fisher criterion in (c). Once more, we see that the KBA

criterion is capable of accurately recovering the number of subclasses, whereas the

Fisher criterion is not.

48
0.3 0.3 0.25 0.25
0.25 0.2 0.2
0.2 0.15 0.15
KBA

KBA

KBA

KBA
0.2
0.1 0.1 0.1
0.15 0.05 0.05
0.1 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i

(a) (b) (c) (d)


3 3 3 3
2.5 x 10 1.5 x 10 1 x 10 1 x 10
2
1
Fisher

Fisher

Fisher

Fisher
1.5 0.5 0.5
0.5
1
0.5 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
H H H H
i i i i

(e) (f) (g) (h)

Figure 2.4: Comparative results between the (a-d) KBA and (e-h) Fisher criteria. The
true underlying number of subclasses per class are (a,e) 2, (b,f) 3, (c,g) 4, and (d,h) 5. The
x-axis specifies the number of subclasses Hi . The y-axis shows the value of the criterion
given in (2.12) in (a-d) and of the Fisher criterion in (e-h).

0.5
4
0.35 x 10
12
0 0.3 10
X2

0.25
Fisher

8
(7)

0.2 6
0.5
0.15 4
0.5 0 0.5 0.1 2
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
X1 H H
i i

(a) (b) (c)

Figure 2.5: (a) The classical XOR classification problem. (b) Plot of the KBA criterion
versus Hi . (c) Plot of the Fisher criterion.

49
2.5 Experimental Results

2.5.1 Homoscedastic criterion

In this section, we will use our homoscedastic criterion to optimize the kernel

parameter of KDA, KNDA and KSDA. We will give comparative results with CV,

the Fisher criterion of [98], the use of the Bregman divergence, and other nonlinear

methods Kernel PCA (KPCA), HLDA and LPP and related linear approaches

LDA, NDA, RDA, SDA, and aPAC. The dimensionality of the reduced space is

taken to be the rank of the matrices used by the DA approach and to keep 90% of the

variance in PCA and KPCA. We also provide comparisons with Kernel Support Vector

Machines (KSVM) [93] and the use of ML in MoG [65], two classical alternatives for

nonlinear classification.

Databases and notation

The first five data-sets are from the UCI repository [7]. The Monk problem is given

by a 6-dimensional feature space defining six joints of a robot and two classes. Three

different case scenarios are considered, denoted Monk 1, 2 and 3. The Ionosphere

set corresponds to satellite imaging for the detection of two classes (structure or not)

in the ground. And, in the NIH Pima set, the goal is to detect diabetes from eight

measurements.

We also use the ETH-80 [53] database. It includes a total of 3, 280 images of the

following 8 categories: apples, pears, cars, cows, horses, dogs, tomatoes and cups.

Each category includes 10 objects (e.g., ten apples), Figure 2.6 . Each of the (80)

objects has been photographed from 41 orientations. We resized all the images to

25 30 pixels. The pixel values in their vector form (x R750 ) are used in the

50
(a)

(b)

Figure 2.6: Shown here are (a) 8 categories in ETH-80 database and (b) 10 different objects
for the cow category.

appearance-based recognition approach. As it is typical in this database, we will use

the leave-one-object-out test. That is, the images of 79 objects are used for training,

those of the remaining object for testing. We test all options and calculate the average

recognition rate.

We also use 100 randomly selected subjects from the AR face database [61]. All

images are first aligned with respect to their eyes, mouth and jaw line before cropping

and resizing them to a standard size of 29 21 pixels. This database contains images

of two different sessions, each taken two weeks apart. The images in the first and

second session contain the same facial expressions and occlusions and were taken

under the same illumination conditions. We use the images in the first session for

training and those in the second session for testing.

We also use the Sitting Posture Distribution Maps data-set (SPDM) of [117].

Here, samples were collected using a chair equipped with a pressure sensor sheet

located on the sit-pan and back-rest. The pressure maps provide a total of 1, 280

pressure values. The database includes samples of 50 individuals. Each participant

51
provided five samples of each of the ten different postures. Our goal is to classify

each of the samples into one of the ten sitting postures. This task is made difficult by

the nonparametric nature of the samples in each class [117]. We randomly selected 3

samples from each individual and posture for training, and used the rest for testing.

The Modified National Institute of Standards and Technology (MNIST) database

of [52] is a large collection of various sets of handwritten digit (0-9). The training

set consists of 60,000 samples. The test set has 10,000 samples. All the digits have

been size-normalized to 28 28. We randomly select 30,000 samples for training,

with 3,000 samples in each class. This is done to reduce the size of the Gram matrix,

allowing us to run the algorithm on a desktop.

As defined above, we employe the RBF kernel. The kernel parameter in KPCA

is optimized with CV. CV is also used in KDA, KNDA and KSDA, denoted: KDACV ,

KNDACV and KSDACV . The kernel parameter is searched in the range [m 2st, m +

2st], where m and st are the mean and standard deviation of the distances between all

pairwise training samples. We use 10-fold cross validation in the UCI data-sets and

5-fold cross validation in the others. In KNDA and KSDA, the number of nearest

neighbors and subclasses are also optimized. In KSDA, we test partitions from 1

to 10 subclasses. We also provide comparative results when optimizing with the

approach of [98], denoted: KDAF , KNDAF and KSDAF . The two parameters of

LPP (i.e., the number of nearest neighbors, and the heat kernel) are optimized with

CV. The DA algorithms with our Homoscedastic-based optimization will be denoted:

KDAH , KNDAH and KSDAH . The same algorithms optimized using Bregman are

denoted: KDAB , KNDAB and KSDAB .

52
Results

The algorithms summarized above are first employed to find the subspace where

the feature vectors of different classes are most separated according to the algorithms

criterion. In the reduced space we employ a variety of classification methods.

In our first experiment, we use the nearest mean (NM) classifier. The NM is

an ideal classifier because it provides the Bayes optimal solution whenever the class

distributions are homoscedastic Gaussians [32]. Thus, the results obtained with the

NM will illustrate whether the derived criterion has achieved the desirable goal. The

results are shown in Table 2.1. We see that the kernel algorithms optimized with the

proposed Homoscedastic-based criterion generally obtain higher classification rates.

To further illustrate this point, the table includes a rank of the algorithms following

the approach of [20]. As predicted by our theory, the additional flexibility of KSDA

allows it to achieve the best results.

Our second choice of classifier is the classical nearest neighbor (NN) algorithm.

Its classification error is known to be less than twice the Bayes error. This makes it

appropriate for the cases where the class distributions are not homoscedastic. These

results are in Table 2.2. A recently proposed classification algorithm [75] emphasizes

smoother classification boundaries in the NN framework. This algorithm is based

on the approximation of the nonlinear decision boundary using the sample points

closest to the classification boundary. The classification boundary is smoothed using

Tikhonov regularization. Since our criterion is used to make the classifier in the kernel

space as linear as possible, smooth (close to linear) classifiers are consistent with this

goal and should generally lead to better results. We present the results obtained with

this alternative approach in Table 2.3.

53
Table 2.1: Recognition rates (in percentages) with nearest mean
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 82.6* 73.5 61.7 77.4 82.6* 81.6 61.7 71.6 76.2 74.6 65.6 73.6
AR database 88.1* 78.2 65.5 84.2 87.5* 86.7 69.5 84.2 71.3 61.4 72.5 74.3
SPDM 84.6* 80.1 67.9 83.9* 84.6* 83.2 67.9 83.3 82.4 82.9 53.4 75.6
Monk1 88.2* 85.0 71.1 88.0* 84.0 89.6* 65.3 83.1 70.1 65.7 50.0 63.4
Monk2 76.6 82.2* 56.7 74.5 80.1 75.2 55.6 70.1 73.5 64.8 61.8 71.8
Monk3 96.3* 88.7 85.4 94.0 93.1 89.7 85.7 82.4 67.6 63.7 77.8 66.4
Ionosphere 93.4 84.8 88.1 96.0* 93.4 86.1 67.6 80.8 74.8 62.3 65.6 78.2
Pima 80.4* 77.4 70.2 80.4* 78.6 75.0 75.0 72.6 65.5 67.3 70.8 66.7
Mnist 98.0* 96.9 92.0 97.4 98.1* 96.6 92.0 97.2 94.6 94.3 93.1 96.4
Rank 1.9* 7.0 13.3 3.6 2.8 5.4 14.2 9.2 12.2 14.7 15.8 13.3
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 56.9 56.5 63.3 64.9 64.0 58.2 65.9 71.6 70.9
AR database 75.5 86.7 42.2 24.0 79.3 69.7 24.2 67.4 46.2 78.6 79.3
SPDM 73.4 84.7* 62.6 66.4 44.5 52.5 65.3 68.0 54.7 59.5 69.3
Monk1 80.3 83.6 67.4 66.0 64.6 64.8 66.0 66.2 44.4 72.0 66.7
Monk2 75.9 82.6 53.7 53.5 55.1 60.0 53.5 53.5 48.6 60.0 55.1
Monk3 89.4 93.5 78.9 80.6 63.9 81.3 80.6 81.3 75.5 86.3 80.8
Ionosphere 82.1 96.0 89.4 62.3 57.0 92.1 62.3 90.1 55.0 82.8 90.1
Pima 75.0 79.2 50.0 56.0 61.3 74.4 56.0 77.4 67.9 66.7 61.3
Mnist 88.6 97.6* 80.6 82.2 86.7 85.9 82.2 85.5 80.1 87.0 88.2
Rank 9.8 2.7 18.0 19.1 18.3 14.4 18.4 14.1 19.9 12.4 12.8

Note that the results obtained with the Homoscedastic criterion are generally better than
those given by the Fisher, Bregman and CV criteria. The best of the three results in each
of the discriminant methods is bolded. The symbol * is used to indicate the top result
among all algorithms. Rank goes from smallest (best) to largest.

Finally, recall that the goal of the Homoscedastic criterion is to make the Bayes

classifier in the kernel space linear. If this goal were achieved, one would expect a

linear classifier such as linear Support Vector Machines (SVM) to yield good classifi-

cation results in the corresponding subspace. We verified this hypothesis in our final

experiment, Table 2.4.

As mentioned earlier, the advantage of the proposed criterion is not only that it

achieves higher classification rates, but that it does so at a lower computational cost,

Table 2.5. Note that the proposed approach generally reduces the running time by

one order of magnitude.

54
Table 2.2: Recognition rates (%) with nearest neighbor
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 82.8* 73.6 62.3 76.8 82.8* 81.0 62.3 71.6 76.2 74.6 68.0 70.6
AR database 96.7* 78.3 66.9 84.2 88.3 87.5 71.3 84.2 69.2 64.2 70.6 70.2
SPDM 84.9* 80.1 68.2 83.7 84.9* 84.2 68.2 83.3 73.9 75.6 33.5 70.3
Monk1 89.1* 84.5 78.2 87.5 84.3 89.6* 72.5 83.1 78.2 77.1 74.5 72.2
Monk2 77.8 83.1 86.1 75.7 80.1 75.2 77.6 70.1 85.0* 81.0 79.9 78.5
Monk3 94.4* 87.7 81.5 89.8 93.5 88.0 89.4 82.4 82.1 81.3 77.6 80.3
Ionosphere 94.4 84.8 91.4 94.0 94.4 86.5 70.9 80.8 87.4 86.1 90.1 86.1
Pima 75.0 73.8 66.7 76.8 70.2 69.8 64.9 72.6 67.3 67.3 66.1 69.1
Mnist 97.8* 96.9 91.8 97.2 97.2 97.1 91.8 96.7 95.6 95.4 92.1 95.5
Rank 2.9* 8.0 13.6 5.3 3.7 7.7 15.4 10.8 11.3 12.7 15.7 14.1
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 62.2 64.3 64.3 59.8 73.6 56.5 63.6 71.6 70.6
AR database 75.5 86.7 42.5 58.6 77.7 77.0 59.1 67.5 41.8 78.6 77.7
SPDM 73.4 84.7 75.0 81.5 66.5 48.8 81.1 65.3 54.1 59.5 66.1
Monk1 80.3 83.6 90.3* 81.3 69.0 68.3 81.0 84.2 61.6 72.0 75.7
Monk2 75.9 82.6 68.3 66.7 67.4 82.6 79.6 83.6 82.4 60.0 67.4
Monk3 89.4 93.5 87.8 87.3 70.6 83.6 88.4 84.5 80.6 86.3 85.9
Ionosphere 82.1 96.0* 89.4 92.1 74.8 88.8 92.1 88.7 68.2 82.8 93.4
Pima 75.0 79.2 56.0 64.3 57.7 69.1 62.5 68.5 66.8 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.7 85.6 89.3 80.6 96.0 87.0 93.7
Rank 12.6 3.2 14.2 14.3 18.4 15.4 12.7 14.3 17.4 16.2 14.1

Table 2.3: Recognition rates (%) with the smooth nearest-neighbor classifier
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 83.5* 73.9 62.3 76.4 83.5* 82.8 62.3 72.9 76.2 74.2 68.2 71.2
AR database 96.6* 78.5 66.9 85.1 90.6 86.7 71.3 85.1 70.9 63.2 70.6 72.6
SPDM 84.3* 75.3 68.2 83.9* 84.3* 83.4 68.2 82.6 75.6 77.9 35.6 71.5
Monk1 90.2* 76.6 71.5 82.9 89.6 87.7 72.2 88.7 65.2 62.0 61.4 62.3
Monk2 83.3* 77.5 60.6 75.7 80.6 82.9 73.8 78.5 74.1 64.8 62.3 56.9
Monk3 94.6* 83.3 86.1 86.3 93.5 92.4 89.4 91.2 68.5 64.8 85.4 66.2
Ionosphere 94.3 84.8 84.8 86.1 94.3 86.8 80.1 86.8 80.8 82.8 77.5 78.1
Pima 80.4* 76.8 79.2 76.2 78.6 73.0 64.9 69.0 72.0 67.9 69.0 67.9
Mnist 97.8* 96.9 91.8 97.3 97.2 97.2 91.8 96.7 95.6 95.4 92.1 95.6
Rank 1.2* 9.4 14.4 6.7 2.7 4.6 15 6.9 14.2 14.7 17.6 16.1
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 60.3 67.1 64.3 63.5 71.2 59.1 64.3 71.6 72.3
AR database 75.5 86.7 49.5 44.5 70.9 77.3 60.2 67.5 35.5 78.6 70.9
SPDM 73.4 84.7 75.1 77.0 56.2 50.2 81.2 53.4 50.2 59.5 69.5
Monk1 80.3 83.6 77.3 78.2 67.4 77.8 69.4 71.5 59.0 72.0 79.2
Monk2 75.9 82.6 58.6 56.7 70.6 70.6 70.4 58.3 72.0 60.0 70.6
Monk3 89.4 93.5 91.2 89.7 70.8 91.9 89.6 93.8 87.0 86.3 90.5
Ionosphere 82.1 96.0* 82.1 82.1 74.8 83.4 91.1 94.0 62.9 82.8 89.4
Pima 75.0 79.2 60.7 70.2 57.7 70.2 63.8 72.6 66.1 66.7 57.7
Mnist 88.6 97.6 94.1 90.1 89.8 86.0 89.4 82.6 96.1 87.0 93.5
Rank 11.6 2.7 15.6 14.6 17.8 13.4 13.9 14.9 18.1 14.9 11.9

55
Table 2.4: Recognition rates (%) with linear SVM
Data set ksdaH ksdaF ksdaB ksdaCV kdaH kdaF kdaB kdaCV kndaH kndaF kndaB kndaCV
ETH-80 83.0* 73.6 61.9 77.4 83.0* 82.2 61.9 71.3 75.6 75.2 65.6 74.6
AR database 88.1* 79.6 65.5 83.1 87.5* 86.7 69.5 83.1 79.4 75.7 72.5 78.6
SPDM 82.1 84.6* 67.5 82.3 82.1 83.6 67.5 82.6 82.2 82.9 52.7 84.0
Monk1 89.1* 88.2 50.0 86.1 84.7 89.7* 52.1 86.1 69.9 62.5 50.0 63.4
Monk2 77.1 81.5 67.1 73.8 80.1 75.2 67.1 75.1 67.1 83.1* 67.1 67.1
Monk3 95.6* 91.9 47.2 94.4 92.8 89.1 47.2 81.5 81.7 81.7 47.2 81.0
Ionosphere 93.4 86.1 82.1 96.7* 93.4 86.1 82.1 82.1 82.1 82.1 82.1 82.1
Pima 79.8* 78.6 64.9 79.8* 78.0* 75.0 64.3 72.8 64.3 64.3 64.3 64.3
Mnist 97.9 96.9 92.0 97.3 98.1* 96.7 92.0 97.2 94.7 94.3 93.3 96.2
Rank 2.8* 5.6 17.8 4.3 4.1 5.8 17.7 9.5 11.9 11.6 17.3 13.0
Data set mog ksvm kpca pca lda nda apac hlda lpp rda sda
ETH-80 69.2 81.8 65.3 60.1 65.3 61.8 68.4 68.4 62.1 71.6 67.8
AR database 75.5 86.7 42.1 66.7 79.3 69.7 67.2 70.1 44.2 78.6 79.3
SPDM 73.4 84.7* 66.7 76.5 50.3 49.0 82.1 69.3 45.5 59.5 69.0
Monk1 80.3 83.6 88.4* 67.8 65.6 66.4 67.8 68.5 44.9 72.0 66.7
Monk2 75.9 82.6 50.0 67.1 67.1 67.5 65.6 67.1 67.1 60.0 67.1
Monk3 89.4 93.5 94.4 81.3 63.9 83.3 80.6 81.9 78.5 86.3 84.7
Ionosphere 82.1 96.0 82.1 84.8 84.8 88.1 93.4 93.4 82.1 82.8 90.1
Pima 75.0 79.2 64.3 68.6 64.9 76.8 77.4 76.2 76.2 66.7 64.9
Mnist 88.6 97.6 81.0 82.2 86.9 85.9 83.1 85.4 80.1 87.0 88.2
Rank 11.5 3.3 16.1 16.1 15.8 14.6 13.6 12.5 19.0 13.8 12.9

Table 2.5: Training time (in seconds)


Data set ksdaH ksdaCV kdaH kdaCV kndaH kndaCV ksvm
ETH-80 7.3104 3.6105 1.8103 9.0104 7.9104 8.5105 1.8104
AR database 4.2104 3.5105 3.1103 9.0104 1.5104 1.7105 1.2104
SPDM 1.8104 6.5104 1.8102 4.6104 2.1104 1.6105 9.6103
Monk1 4.4 51.3 0.7 6.8 26.4 504.8 3.7
Monk2 4.6 88.1 1.2 11.5 41.3 978.1 17.8
Monk3 3.2 50.7 0.7 6.4 23.1 516.0 2.2
Ionosphere 6.6 134.8 1.3 15.7 76.6 1479.5 10.1
Pima 80.2 2521.7 12.1 380.1 374.4 10889.7 150.6
MNIST 3.6105 2.0106 1.9105 1.1106 3.2105 4.6106 4.5105

56
2.5.2 KBA criterion

We now present results using KBA criterion. We use this criterion in KDA and

KSDA. We use the notation KDAK and KSDAK to indicate that the KBA criterion

was used to optimize the parameters.

Table 2.6: Recognition rates (%) with nearest neighbor. Bold numbers specify the top
recognition obtained with the three criteria in KSDA and KDA. An asterisk specifies
a statistical significance on the highest recognition rate.

Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca


ETH-80 84.6* 73.6 76.8 84.6* 81.0 71.6 62.2
AR database 88.2* 78.3 84.2 86.1 87.5 84.2 42.5
SPDM 84.3* 80.1 83.7 84.3* 84.2 83.3 75.0
Monk1 88.0 84.5 87.5 87.3 89.6* 83.1 90.3*
Monk2 82.9* 83.1* 75.7 82.9* 75.2 70.1 68.3
Monk3 94.2* 87.7 89.8 92.6 88.0 82.4 87.8
Ionosphere 93.0 84.8 94.0* 89.1 86.5 80.8 89.4
Pima 73.2 73.8 76.8* 76.2* 69.8 72.6 56.0
Data set pca lda nda apac hlda rda sda
ETH-80 64.3 64.3 59.8 73.6 56.5 71.6 70.6
AR database 58.6 77.7 77.0 59.1 67.5 78.6 77.7
SPDM 81.5 66.5 48.8 81.1 65.3 59.5 66.1
Monk1 81.3 69.0 68.3 81.0 84.2 72.0 75.7
Monk2 66.7 67.4 82.6 79.6 83.6 60.0 67.4
Monk3 87.3 70.6 83.6 88.4 84.5 86.3 85.9
Ionosphere 92.1 74.8 88.8 92.1 88.7 82.8 93.4*
Pima 64.3 57.7 69.1 62.5 68.5 66.7 57.7

The linear and nonlinear feature extraction methods described earlier are used

to find an appropriate low-dimensional representation of the data. Here, we use

the classical RBF kernel defined earlier. In this low-dimensional space, we provide

successful classification results using three methods: the classical nearest neighbor

57
Table 2.7: Recognition rates (%) with the classification method of [75].
Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca
ETH-80 84.6* 73.9 76.4 84.6* 82.8 72.9 60.3
AR database 89.6* 78.5 85.1 87.5 86.7 85.1 49.5
SPDM 84.9* 75.3 83.9 84.9* 83.4 82.6 75.0
Monk1 88.0* 76.6 82.9 87.3 87.7 88.7* 77.3
Monk2 82.9* 77.5 75.7 82.9* 82.9* 78.5 58.6
Monk3 90.5 83.3 86.3 92.6 92.4 91.2 91.2
Ionosphere 92.8* 84.8 86.1 89.1 86.8 86.8 82.1
Pima 78.6* 76.8 76.2 76.2 73.0 69.0 60.7
Data set pca lda nda apac hlda rda sda
ETH-80 67.1 64.3 63.5 71.2 59.1 71.6 72.3
AR database 44.5 70.9 77.3 60.2 67.5 78.6 70.9
SPDM 77.0 56.2 50.2 81.2 53.4 59.5 69.5
Monk1 78.2 67.4 77.8 69.4 71.5 72.0 79.2
Monk2 56.7 70.6 70.6 70.4 58.3 60.0 70.6
Monk3 89.7 70.8 91.9 89.6 93.8* 86.3 90.5
Ionosphere 82.1 74.8 83.4 91.1 94.0* 82.8 89.4
Pima 70.2 57.7 70.2 63.8 72.6 66.7 57.7

(NN) classifier, the extension of K-NN defined in [75], and a linear Support Vector

Machines (SVM). The recognition results are shown in Tables 2.6-2.8.

From these results, it is clear that, on average, the derived KBA criterion achieves

higher classification rates than the Fisher criterion and CV. As expected, KSDA

generally yields superior results than KDA. This is due to the added flexibility on

modeling the underlying class distributions in the kernel space provided by KSDA. To

illustrate the effectiveness of the proposed criterion in KSDA, we show the smooth-

ness of the function optimized by the criterion in Fig. 2.7 for four of the data-sets.

Note how these functions can be readily optimized using gradient ascent. It is also

interesting to note that the optimal value of remains relatively constant for different

58
Table 2.8: Recognition rates (%) with linear SVM.
Data set ksdaK ksdaF ksdaCV kdaK kdaF kdaCV kpca
ETH-80 84.2* 73.6 77.4 84.2* 82.2 71.3 65.3
AR database 86.7* 79.6 83.1 85.3 86.7* 83.1 42.1
SPDM 84.3* 84.6* 82.3 84.3* 83.6 82.6 66.7
Monk1 87.3 88.2 86.1 87.3 89.7* 86.1 88.4*
Monk2 82.9* 81.5 73.8 82.9* 75.2 75.1 50.0
Monk3 93.5 91.9 94.4* 91.9 89.1 81.5 94.4
Ionosphere 92.6 86.1 96.7* 89.1 86.1 82.1 82.1
Pima 79.8* 78.6 79.8* 77.4 75.0 72.8 64.3
Data set pca lda nda apac hlda rda sda
ETH-80 60.1 65.3 61.8 68.4 68.4 71.6 67.8
AR database 66.7 79.3 69.7 67.2 70.1 78.6 79.3
SPDM 76.5 50.3 49.0 82.1 69.3 59.5 69.0
Monk1 67.8 65.6 66.4 67.8 68.5 72.0 66.7
Monk2 67.1 67.1 67.5 65.6 67.1 60.0 67.1
Monk3 81.3 63.9 83.3 80.6 81.9 86.3 84.7
Ionosphere 84.8 84.8 88.1 93.4 93.4 82.8 90.1
Pima 68.6 64.9 76.8 77.4 76.2 66.7 64.9

values of Hi . This smoothness in the change of the criterion is what allows to find

the global optimum efficiently.

2.6 Conclusions

In this chapter, we have proposed two approaches to do kernel learning in dis-

criminant analysis. The first approach optimizes the parameters of a kernel whose

function is to map the original class distributions to a space where these are optimally

(w.r.t. Bayes) separated with a hyperplane. We have achieved this by selecting the

kernel parameters that make the class Normal distributions most homoscedastic while

maximizing class separability. Experimental results in a large variety of datasets has

demonstrated that this approach achieves higher recognition rates than most other

59
0.1 0.09 0.3
0.4
0.08
0.08 0.25
(7)

(7)
0.06 0.2
0.07
0.2
0.04
10 0.06
0
20 0.15
5 6 0.05 10
4 10 5
2 0.1
0 0 Hi 0 0 H
i

0.09
0.06
0.08
0.055 0.1
0.08 0.07
0.05 0.08
0.06
0.06 0.045
(7)

0.06
0.05
(7)

0.04
0.04 0.04 0.04
10 0.035 10
0.02 0.03 0.02 0.03
4 5 6 5
H 4
2 i 0.025 2
0 0 0 0 H
i

Figure 2.7: Plots of the value of the derived criterion as a function of the kernel parameter
and the number of subclasses. From left to right and top to bottom: AR, ETH-80, Monk
1, and Ionosphere databases.

60
methods defined to date. We have also shown that adding the subclass divisions to

the optimization process (KSDA) allows the DA algorithm to achieve better gener-

alizations. And, we have formally defined the relationship between KDA and other

variants of DA, such as weighted DA, NDA and SDA.

The second approach we have defined is directly related to the Bayes error. We first

derive a function which computes the Bayes accuracy, defined as one minus the Bayes

error, in the kernel space. Thus, the goal is to find that kernel representation where

the highest classification accuracy is achieved. Extensive experimental results on a

number of databases shows that the derived approach yields superior classification

results to those given by existing algorithms. Moreover, we have demonstrated that,

when used in KSDA, the proposed criterion can accurately recover the underlying

structure of the class distributions.

61
CHAPTER 3

MODEL SELECTION IN KERNEL METHODS IN


REGRESSION

3.1 Introduction

Regression analysis has been a very active topic in machine learning and pattern

recognition, with applications in many problems in science and engineering. In a

standard regression problem, a linear or nonlinear model is estimated from the data

such that the functional relationship between the dependent variables and the inde-

pendent variables can be established. Of late, regression with kernel methods [96, 85]

has become popular. The success of the kernel methods in regression comes from the

fact that they facilitate the estimation of nonlinear function using well-defined and

-tested approaches in, for example, computer vision [99], signal processing [94], and

bioinformatics [76].

In kernel-based regression, the goal is to find a kernel mapping that converts the

original nonlinear problem (defined in the original space) into a linear one (in the

kernel space) [84]. In practise, this mapping is done using a pre-determined nonlinear

function. Given this function, the main challenge is to fine those parameters of the

function that convert a nonlinear problem into a linear one. Thus, the selection of

these kernel parameters is a type of model selection. This is the problem we consider

62
in this chapter to define a criterion for the selection of the appropriate parameters

of this kernel function.

The selection of the appropriate parameters of a kernel is a challenging one [113].

If the parameters were chosen to minimize the model fit, we would generally have an

over-fitting to the training data. As a consequence, the regressed function would not

be able to estimate the testing data correctly. A classical solution is to find a good

fit, while keeping the complexity of the function low, e.g., using a polynomial of lower

order [42]. However, if the parameters are selected to keep the complexity too low,

then we will under-fit the data. In both these cases, the regressed function will have

a poor generalization, i.e., a high prediction error to the testing data. In general, the

kernel parameters should be selected to achieve an appropriate trade-off between the

model fit and model complexity.

As in KDA (Chapter 2), the most widely employed technique to do selection

of the kernel parameters is k-fold cross-validation (CV) [88]. In this approach, the

performance of the prediction models is evaluated by setting aside a validation set

within the training set. The model which produces the smallest validation error is

selected. Unfortunately, this method has three known major drawbacks. First, it is

computational expensive. Second, only part of the training data is used to estimate

the model parameters. When doing model selection, one wants to employ the largest

possible number of training samples, since this is known to yield better generalizations

[63]. Third, the value of k as a parameter plays a major role in the process. Note that

the value of k affects the trade-off between the fitting error and the model complexity,

yet general methods for selecting an appropriate value do not exist.

63
An alternative to CV is Generalized CV (GCV) [37, 96], an efficient approximation

to the leave-one-out CV. GCV has been efficiently applied to some model selection

problems [42, 116]. However, since it approximates the leave-one-out CV, the es-

timated result generally has a large variance, i.e., the regressed function is highly

variant and dependent of the training data.

While a single kernel may not be sufficient to describe the data, multiple kernel

learning (MKL) [51, 87] has attracted much attention recently as a potential alterna-

tive. In [76], MKL is applied to Support Vector Regression (SVR). The coefficients

that determine the combination of kernels are learned using a constrained quadratic

programming problem. This method was shown to outperform CV in some applica-

tions. Unfortunately, the selection of the kernel functions and associated parameters

remains an open problem. In another approach, the regression problem is first refor-

mulated as probabilistic models using Bayesian inference, then the kernel parameters

are selected by maximizing the marginal data likelihood. This approach has been used

to define the well-known Gaussian processes for regression [104]. It has been shown

[80] that the marginal likelihood has the nice property of automatically incorporat-

ing a trade-off between model fit and model complexity. However, since the Bayesian

learning generally leads to analytically intractable posteriors, approximations are nec-

essary, and, the results are generally computationally expensive. Furthermore, the

determination of the priors for the parameters is an intrinsic problem in Bayesian

learning with no clear solution.

In this chapter, we resolve the kernel optimization problem using a completely

novel approach. In our proposed approach, the two measures of model fit and

model complexity are simultaneously minimized using a multiobjective optimization

64
(MOP) framework through the study of Pareto-optimal solutions. MOP and Pareto-

optimality are specifically defined to find the global minima of several combined cri-

teria. To this end, we will first derive a new criterion for model complexity which can

be employed in kernel methods in regression. We then define a method using MOP

and derive a new approach called modified -constraint. We show that this newly

derived approach achieves the lowest mean square error. We provide extensive com-

parisons with the state of the art in kernel methods for regression and on approaches

for model selection. The results show that the proposed framework generally leads

to better generalizations for the (unseen) testing samples.

The remainder of this chapter is organized as follows. In Section 3.2 we derive

the two new measures of model fitness and model complexity. Then, in Section 3.3,

we derive a new MOP approach to do model selection. In Section 3.4, the proposed

framework is applied to two typical kernel methods in regression. Experimental results

are provided in Section 3.5. We conclude in Section 3.6.

3.2 Regression Models

We start with an analysis of the generalization error of a regression model. Given

a training set {(xi , yi )}ni=1 , where xi Rp , yi Rq , with the training samples

(xi , yi ), i = 1, ..., n generated from a joint distribution g(x, y), one wants to find

the regression model f(x) that minimizes the generalization error


Z
E= L(y, f(x))g(x, y)dxdy, (3.1)

where f(x) is the regression function, f(x) = (f1 (x), ..., fq (x))T , fi (.) : Rp R is

the ith regression function, and L(y, f (x)) is a given loss function, for instance, the
Pq
quadratic loss L(y, f(x)) = 12 ky f(x)k22 = 1
2 i=1 (yi fi (x))2 .

65
3.2.1 Generalization error

Holmstrom and Koistinen [44] show that by adding noise to the training samples

(both x and y), the estimation of the generalization error is asymptotically consistent,

i.e., as the number of training examples approaches infinity, the estimated general-

ization error is equivalent to the true one. The addition of noise can be interpreted

as generating additional training samples.

For convenience, denote the training set of n pairs of observation and prediction

vectors by zi = (xi , yi ), i = 1, ..., n, zi Rm , m = p + q. Then, the generalization

error can be rewritten as


Z
E= L(z)g(z)dz. (3.2)

Assume that the training samples zi are corrupted by the noise . Suppose the

distribution of is (). The noise distribution is generally chosen to have zero mean

and to be uncorrelated, i.e.,


Z
i()d = 0, (3.3)
Z
i j ()d = ij , (3.4)

where is the variance of the noise distribution, and ij is the delta function, with

ij = 1 when i = j and ij = 0 otherwise.

We consider the following steps for generating new training samples by introducing

additive noise:

1) Randomly select a sample zi from the training set.

2) Draw a sample noise vector i from ().

3) Set z = zi + i .

66
Thus, the distribution of a particular sample z generated from the training sample

zi is given by ( i ) = (z zi). Then, the distribution of z generated from the entire

training set is
n
1X
g(z) = (z zi ). (3.5)
n i=1

The above result can be viewed as a kernel density estimator of the true distri-

bution of the data g(z) [44]. The distribution of the noise (.) is the kernel function

used in the estimator.

Substituting (3.5) into (3.2), we have


Z
E = L(z)g(z)dz
n Z
1X
= L(z)(z zi )dz. (3.6)
n i=1

Let z zi = i , then (3.6) is reformulated as

n Z
1X
E= L(zi + i )( i )di . (3.7)
n i=1

We expand L(z + ) as a Taylor series in powers of , i.e.,

m m X m
X L(z) 1X 2 L(z)
L(z + ) = L(z) + i + i j + O( 3 ). (3.8)
i=1 zi 2 i=1 j=1 z i zj

Assuming that the noise amplitude is small, the higher order term O(3 ) can be

neglected. Combining (3.8) with (3.7), (3.3) and (3.4), we obtain,



n m
1X 1 X
L(zi ) +
2 L(zi )
E =
n i=1 2 j=1 zj2
n n X m
1X 1 X 2 L(zi )
= L(zi ) + . (3.9)
n i=1 2n i=1 j=1 zj2

67
1 Pq
Let L(z) be the quadratic loss, i.e., L(z) = 2 i=1 (yi fi (x))2 . Then,
m
X 2 L(zi )
j=1 zj2
m X q
1X 2 (yik fk (xi ))2
=
2 j=1 k=1 zj2

q
X p
X 2 2 q 2 2
1
(yik fk (xi )) X
(yik fk (xi ))
= 2
+
2 k=1 j=1 xij j=1 yij2
!2
q p p
X

X fk (xi ) X 2 fk (xi )
= + (fk (xi ) yik ) 2
+ 1
k=1 j=1 xij j=1 x ij
!2
q X
p
X

fk (xi ) 2 fk (xi )
= + (fk (xi ) yik ) + q, (3.10)
k=1 j=1 xij x2ij

where yij is the j th entry of vector yi and xij is the j th entry of vector xi . Substituting

(3.10) into (3.9), we have

E = Ef + Ec , (3.11)

with
n
1 X
Ef = kyi f(xi )k22 (3.12)
2n i=1
and
!2
n X q X p
1 X
fk (xi )
Ec = + (fk (xi ) yik )
2n i=1 k=1 j=1 xij
!
2 fk (xi )
+ p1 . (3.13)
x2ij

Therefore, the generalization error consists of two terms. The first term Ef mea-

sures the discrepancy between the training data and the estimated model, i.e., the

model fit. The second term Ec measures the roughness of the estimated function pro-

vided by the first and second derivatives of the function, i.e., the model complexity. It

controls the smoothness of the function to prevent it from overfitting. The parameter

controls the trade-off between the model fit and model complexity.

68
In order to minimize the generalization error E, we need to minimize both Ef and

Ec . However, due to the bias and variance trade-off [42], a decrease in the model fit

may result in an increase in the model complexity and vice-versa. The regularization

parameter may achieve a balance between the model fit and complexity to some

extent, however, there are two limitations for selecting to do model selection. First,

a good should be chosen beforehand. A common way is to use cross-validation, but

this suffers from several drawbacks as we discussed earlier. Second, note that our goal

is to simultaneously minimize model fit and model complexity. An ideal solution is

that we cannot further decrease one without increasing the other. This means that

even when the appropriate is selected, minimizing E is not directly related to our

goal. To solve these problems, we derive a multiobjective optimization approach in

Section 3.3. We first derive the kernel models for model fit Ef and model complexity

Ec .

3.2.2 Model fit

We start by considering the standard linear regression model, f(x) = WT x, where

W = (w1 , ..., wq ) is a p q weight matrix, with wi Rp . And, we assume all the

vectors are standardized.

We can rewrite the above model as fi (x) = wiT x, i = 1, ..., q. In kernel methods

for regression, each sample x is mapped to (x) in a reproducing kernel Hilbert


T
space as (.) : Rp F . With this, we can write fi (x) = wi (x), i = 1, ..., q.

The Representers Theorem [96] enables us to use wi = (X)i , where (X) =

((x1 ), ..., (xn )) and i is an n 1 coefficient vector. Putting everything together,

69
we get

fi (x) = Ti (X)T (x) = Ti k(x)


n
X
= ij hxj , xi, i = 1, ..., q, (3.14)
j=1

where ij is the j th element in i , and hxj , xi is a kernel function on xj and x.

Using the results derived thus far, we can write Ef as


q
X
Ef = (yi Ki )T (yi Ki ), (3.15)
i=1

where K = (X)T (X) is the n n kernel matrix, yi = (y1i , ..., yni )T is an n 1

vector, and yji is the ith entry of yj .

3.2.3 Roughness penalty in RBF

We now derive solutions of Ec for two of the most used kernel functions, the Radial

Basis Function (RBF) and the polynomial kernels.


 
kxi xj k2
The RBF kernel is given by hxi , xj i = exp 22
, where is the kernel
Pn
parameter. Since fl (x) = m=1 lm hxm , xi, the partial derivatives are given by
Pn
fl (xi ) m=1 lm hxm , xi i
=
xij xij
Pn  2

m kxm xi k
m=1 l exp 22
=
xij
 Pp 
Pn m (xmk xik )2
m=1 l exp
k=1
22
=
xij
n
!
1 X
m k xm xi k2
= exp (xmj xij ).
2 m=1 l 2 2

Writing this result in matrix form,


p !2
X fl (xi )
= Tl Ri l ,
j=1 xij

70
  2

where Ri = 1
4
Wi WiT , Wi is a np matrix with the j th column equal to exp kx12
xi k
2

 2
 T
(x1j xij ), . . . , exp kxn2
xi k
2 (xnj xij ) .

And, the second partial derivatives are given by


hP  2
 i
2 fl (xi ) 1
n
m=1 lm exp kxm2x2 i k (xmj xij )
= 2
x2ij xij
" n
!
1 X k xm xi k2 (xmj xij )2
= 2 lm exp
m=1 2 2 2
n
!
X k xm xi k2
lm exp
m=1,m6=i 2 2
" n
1 X
= 2 li + lm
m=1
! !#
k xm xi k2 (xmj xij )2
exp 1
2 2 2
=Tl pij ,

  
2 (xmj xij )2
where pij is an n1 vector whose mth (m 6= i) entry is 1
2
exp kxm2x2 i k 2
1 ,
Pp 2 fl (xi ) Pp
and ith entry is 0. Then j=1 x2ij
= Tl pi , where pi = j=1 pij .

Thus,

p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl pi
j=1 x ij

=Tl (ki pTi )l yil pTi l ,

where ki = (hx1 , xi i, . . . , hxn , xi i)T .

Using the above results, we can define the roughness penalty function in the RBF

kernel space as
q 
X 
Ec = Tl Ml qTl l + n , (3.16)
l=1
1 Pn 1 Pn
where M = 2n i=1 (Ri + ki pTi ), and ql = 2n i=1 yil pi .

71
1 1
Ef E
f
Ec 0.8 Ec
0.8
0.6
0.6
0.4

0.4 0.2
2 4 6 2 4 6 8
4
x 10
(a) (b)

Figure 3.1: The two plots in this figure show the contradiction between the RSS and the
curvature measure with respect to: (a) the kernel parameter , and (b) the regularization
parameter in Kernel Ridge Regression. The Boston Housing data-set [7] is used in this
example. Note that in both cases, while one criterion increases, the other decreases. Thus,
a compromise between the two criteria ought to be determined.

3.2.4 Polynomial kernel

A polynomial kernel of degree d is given by hxi , xj i = (xTi xj + 1)d . Its partial

derivatives are,
Pn
fl (xi ) lm hxm , xi i
m=1
=
xij xij
Pn
(xTm xi + 1)d
m
= m=1 l
xij
n
X (xTm xi + 1)
= lm d(xTm xi + 1)d1
m=1 xij
Xn
= lm d(xTm xi + 1)d1 xmj
m=1,m6=i

+ 2lid(xTi xi + 1)d1 xij .

We can write the above result in matrix form as


p !2
X fl (xi )
= Tl Bi l ,
j=1 xij

72

where Bi = dCi CTi , Ci is a np matrix with the j th column equal to (xT1 xi + 1)d1 x1j ,
T
. . . , 2(xTi xi + 1)d1 xij , . . . , (xTn xi + 1)d1 xnj .

The second partial derivatives are


n
2 fl (xi ) X m (xTm xi + 1)d1
= l dxmj
x2ij m=1,m6=i xij
h i
(xTi xi + 1)d1 xij
+ 2li d
xij

 
=dli (xTi xi + 1)d2 3(d 1)x2ij + 2(xTi xi + 1)

n
X
+ (d 1) lm (xTm xi + 1)d2 x2mj
m=1

=Tl gij ,

where gij is a n 1 vector whose mth (m 6= i) entry is d(d 1)(xTm xi + 1)( d 2)x2mj
  Pp 2 fl (xi )
and the ith entry is d(xTm xi + 1)d2 3(d 1)x2ij + 2(xTi xi + 1) . Then, j=1 x2ij
=
Pp
Tl gi , where gi = j=1 gij .

Thus,
p
X 2 fl (xi )
(fl (xi ) yil ) 2
=(Tl ki yil )Tl gi
j=1 xij
=Tl (ki giT )l yil giT l ,

Using the derivations above, the roughness function for the polynomial kernel can be

written as
q 
X 
Ec = Tl Nl uTl l + n , (3.17)
l=1
1 Pn 1 Pn
where N = 2n i=1 (Bi + ki giT ), and ul = 2n i=1 yil gi .

3.2.5 Comparison with other complexity measure

Thus far, we have introduced a new model complexity measure Ec . Ec is related

to the derivatives of the regressed function f(x). A commonly seen alternative in the

73
literature is the norm of the regression function instead. The L2 norm in the kernel

Hilbert space being the most common of norms used in this approach. This section

provides a theoretical comparison between the approach derived in this chapter and

this classical L2 norm alternative. In particular, we show that the L2 norm does

not penalize the high frequencies of the regression function, whereas the proposed

criterion emphasizes smoothness by penalizing the high frequency components of this

function.

To formally prove the above result, we write generalized Fourier series of f (x),


X
f (x) = ak k (x),
k=0

where {k (x)}
k=0 forms a complete orthonormal basis and ak are the corresponding

coefficients. A commonly used complete orthonormal basis is {sin kx, cos kx}
k=0 in

[, ], with k the index of the frequency component. Using this basis set, f (x) can

be written as

X
f (x) = a0 + (ak sin kx + bk cos kx), (3.18)
k=1

where ak and bk are the coefficients of each frequency components.

Let kf kH be the function norm defining the reproducing kernel Hilbert space, then

the L2 norm of f is
Z
kf k2H = |f (x)|2 dx
Z
!2
X
= a0 + (ak sin kx + bk cos kx) dx
k=1

X
=2a0 + (a2k + b2k ). (3.19)
k=1

Note that in this case, all the coefficient are equal, regardless of the frequency com-

ponent.

74
The complexity measure derived in the present chapter and given in (3.13) can be

reformulated as
!2
Z

f (x) 2 f (x)
Ec = + (f (x) y) dx, (3.20)
x x2

where we have neglected the constant p1 .

Moreover, remember for (3.11) that the generalization error E can be expressed

as f (x) = y + O() [6]. Hence, substituting (3.18) into (3.20), yields


Z
!2
X
Ec = k(ak sin kx + bk cos kx) dx
k=1

X
= k 2 (a2k + b2k ). (3.21)
k=1

Compared to the L2 norm result shown in (3.19), the complexity measure (3.21)

of the proposed approach penalizes the higher frequency components of the regressed

function. This is due to the squared of the index of the frequency component seen in

(3.21). By emphasizing lower frequencies, the proposed criterion will generally select

smoother functions than those selected by the L2 norm method.

A numerical comparion is provided in Section 3.5. To do this, we will need the

explicit equation of the L2 norm of the regression function f in the kernel space. This

is given by,

q
X q X
X n X
n
kfk2H = kfi k2H = ij ik h, xj ih, xk i
i=1 i=1 j=1 k=1
Xq Xn X
n
= ij ik hxj , xk i
i=1 j=1 k=1
Xq
= Ti Ki . (3.22)
i=1

75
3.3 Multiobjective Optimization

The parameters in kernel approaches in regression can now be optimized by simul-

taneously minimizing Ef and Ec of the corresponding fitting function described in

the preceding section. Of course, in general, the global minima of these two functions

are not the same. For instance, a decrease in the fitting error may lead to an increase

in the roughness of the function, and vice-versa. This trade-off is depicted in Figure

3.1. In the plots in this figure, we show the performance of the two criteria with

respect to the their corresponding parameters, i.e., the kernel parameter and the

regularization parameter . As can be observed in the figure, the criteria do not share

a common global minimum. To resolve this problem, we now derive a multiobjective

optimization approach.

3.3.1 Pareto-Optimality

As its name implies, multiobjective optimization (MOP) is concerned with the

simultaneous optimization of more than one objective function. More formally, MOP

can be stated as follows,

minimize u1 (), u2 (), ..., uk ()



(3.23)
subject to S,

where we have k objective functions ui : Rp R, and S Rp is the set of possible vec-

tors. Denote the vector of objective functions by z = u() = (u1 (), u2(), ..., uk ())T ,

and the decision vectors as = (1 , 2 , ..., p )T .

The goal of MOP is to find that which simultaneously minimizes all uj (.). If

all functions shared a common minimum, the problem would be trivial. In general,

however, the objective functions contradict one another. This means that minimizing

76
one function can increase the value of the others. Hence, a compromise solution

is needed to attain a maximal agreement of all the objective functions [66]. The

solutions of the MOP problem are called Pareto-optimal solutions. To provide a

formal definition, let us first state another important concept.

Definition 4. A decision vector 1 is said to dominate 2 if ui( 1 ) ui ( 2 ) for all

i = 1, ..., k and uj ( 1 ) < uj ( 2 ) for at least one index j.

This definition now allows us to give the following formal presentation of Pareto-

optimality.

Definition 5. A decision vector S is Pareto-optimal if there does not exist

another decision vector S for which ui () ui ( ) for all i = 1, ..., k and

uj () < uj ( ) for at least one index j.

In other words, a Pareto-optimal solution is not dominated by any other decision

vector. Similarly, an objective vector z Z(= u(S)) is called Pareto-optimal if the

decision vector corresponding to it is Pareto-optimal. We can see that such a vector is

the one where none of the components can be improved without deteriorating one or

more of the others. In most problems, there will be many Pareto-optimal solutions.

This set of Pareto-optimal solutions is called the Pareto-optimal set or Pareto-frontier.

3.3.2 The -constraint approach

One classical method to find the Pareto-optimal solutions is the -constraint ap-

proach [39]. In this case, one of the objective functions is optimized while the others

are considered as constraints. This is done by defining constraints as upper-bounds

of their objective functions. Therefore, the problem to be solved can be formulated

77
Figure 3.2: Here we show a case of two objective functions. u(S) represents the set of all
the objective vectors with the Pareto frontier colored in red. The Pareto-optimal solution
can be determined by minimizing u1 given that u2 is upper-bounded by .

as follows,
arg min ul ()

subject to uj () j , for all j = 1, ..., k, j 6= l, (3.24)

S,
where l {1, ..., k}.

Figure 3.2 demonstrates the idea behind this approach. In this figure, we show

a bi-objective example, k = 2. The Pareto-optimal solution is determined by

minimizing u1 provided that that u2 is upper-bounded by .

Before exploring the Pareto-optimality of the -constraint method, let us look at

a weaker definition of the term.

78
Definition 6. A decision vector S is weakly Pareto-optimal if there does not

exist another decision vector S such that ui () < ui( ) for all i = 1, ..., k.

From the above definition, we can see that the Pareto-optimal set is a subset of

the weakly Pareto-optimal set and that a weakly Pareto-optimal solution may be

dominated by any Pareto-optimal solution.

It has been shown [66] that the solution of the -constraint method defined in

(3.24) is weakly Pareto-optimal. This means that the solution to (3.24) cannot be

guaranteed to be Pareto-optimal. Although the solution is determined by the pre-

specified upper-bounds j s and some j s may lead to Pareto-optimal solutions, in

practice, we do not know how to choose j s to achieve the Pareto-optimal solutions.

In the following, we propose a modified version of this method and prove that the

solution to this modified approach is guaranteed to be Pareto-optimal.

3.3.3 The modified -constraint

The main idea of our approach is to reformulate the constraints in (3.24) as equal-

ities. This can be achieved if these equalities are multiplied by a scaler smaller than

or equal to s on the right. Formally, uj () = hj j , hj [0, s], for all j = 1, ..., k, j 6= l.

Let h = (h1 , ..., hl1 , hl+1 , ..., hl )T . Then, the modified -constraint method is given

by
k
X
arg min ul () + s hj
,h
j=1,j6=l

subject to uj () = hj j , for all j = 1, ..., k, j 6= l,


(3.25)
0 hj s, for all j = 1, ..., k, j 6= l,

S,
where s is a positive constant. We can now prove the Pareto-optimality of (3.25).

79
Pk
Theorem 7. Select a small scalar s satisfying s j=1,j6=l hj ul (x) ul (x ), where

S and h are the solutions of (3.25). Then, is Pareto-optimal for any given

upper-bound vector = (1 , ..., l1, l+1 , ..., k )T .

Pk
Proof. Let S and h be a solution of (3.25). Since s j=1,j6=l hj ul () ul ( ),

we have ul ( ) ul () for all S when uj ( ) = hj j , for every j = 1, ..., k, j 6= l.

Let us assume that is not Pareto-optimal. In this case, there exists a vector o S

such that ui ( o ) ui ( ) for all i = 1, ..., k and uj ( o ) < uj ( ) for at least one index

j.

If j = l, this means that ul ( o ) < ul ( ). Here we have a contradiction with the

fact that ul ( ) ul () for all S.

If j 6= l, then ul ( o ) ul ( ), uj ( o ) < uj ( ) = hj j and ui ( o ) ui ( ) = hi i

for all i 6= j and i 6= l. Denote ui ( o ) = hoi i , for all i 6= l. Then, we have l 1

inequalities hoi i hi i with at least one strict inequality hoj j < hj j . Canceling out
Pk Pk
i on each of the inequality and taking their sum, yields j=1,j6=l hoj < j=1,j6=l hj .
Pk
This contradicts the fact that the solution to (3.25) minimizes j=1,j6=l hj .

We can demonstrate the utility of this modified -constraint method in the fol-

lowing two examples. In our first example, the objective functions are given by
(
1 x1
u1 (x) =
x2 otherwise

and u2 (x) = (x 5)2 . In our second example, the two functions are given by u1 (x) =
2
1 e(x+1) and
( 2
1 e(x2) x 0.5
u2 (x) =
1 e2.25 otherwise.
In both these examples, we compare the performance of the proposed modified -

constraint approach and the -constraint method. This is illustrated in Figure 3.3. In

80
these figures, the blue stars denote the objective vectors and the red circles represent

the solution vectors given by each of the two methods. We see that in Figure 3.3a and

3.3c, the original -constraint method includes the weakly Pareto-optimal solutions,

whereas in Figure 3.3b and 3.3d the proposed modified approach provides the Pareto-

optimal solutions.

Using the solution defined above, we can formulate the parameter optimization

problem as follows,
arg min Ef () + sh
,h

subject to Ec () = h (3.26)

0 h s.
Note that given different s, we may have different Pareto-optimal solutions.

In our parameter optimization problem, we only need one Pareto-optimal solution.

Hence, our next goal is to define a mechanism to determine an appropriate value for

To resolve this problem, we select such that the corresponding Pareto-optimal

objective vector is as close to the ideal point as possible. Specifically, let be a

Pareto-optimal solution given , then the optimal is


  2 
= arg min wf Ef ( ) zf + wc (Ec ( ) zc )2 , (3.27)

where zf and zc are the ideal vectors of Ef () and Ec (), respectively, and wf , wc are

the weights associated to each of the objective functions. The incorporation of these

weights can drive the optimization to favor one objective function over the other. If

Ef ( ) (or Ec ( )) is close to its ideal value zf (zc ), then wf (wc ) should be relatively

small. But if Ef ( ) (Ec ( )) is far apart from it ideal value zf (zc ), then wf (wc )

81
40 40
objective points objective points
30 solution points 30 solution points
2

2
20 20
u

u
10 10

0 0
0 20 40 60 80 0 20 40 60 80
u u
1 1
(a) (b)
1 1
objective points objective points
0.98 0.98
solution points solution points
0.96 0.96
2

0.94
u2

0.94
u

0.92 0.92

0.9 0.9

0.88 0.88
0.7 0.8 0.9 1 0.7 0.8 0.9 1
u1 u1

(c) (d)

Figure 3.3: Comparison between the proposed modified and the original -constraint meth-
ods. We have used * to indicate the objective vector and o to specify the solution vector.
Solutions given by (a) the -constraint method and (b) the proposed modified -constraint
approach on the first example, and (c) the -constraint method and (d) the modified -
constraint approach on the second example. Note that the proposed approach identifies the
Pareto-frontier, while the original algorithm identifies weakly Pareto-solutions, since the
solution vectors go beyond the Pareto-frontier.

82
Algorithm 3.1 Modified -constraint algorithm
Input: Training set {(x1 , y1 ), ..., (xn , yn )}, 0 , h0 , 0 , s.
1. Calculate the ideal vector point (zf , zc ).
2. Specify the weights wf and wc using (3.28).
3. Obtain using (3.27).
4. Obtain using (3.26).
Return: The optimal model parameter .

should be large. This can be formally stated as follows,

wf = |Ef ( 0 ) zf |2 ,

wc = |Ec ( 0 ) zc |2 , (3.28)

where 0 is the initialization for . The proposed modified -constraint approach is

summarized in Algorithm 3.1.

3.3.4 Alternative Optimization Approaches

Thus far, we have derived a MOP approach for model selection based on Pareto-

optimality. The most pressing question for us is to show that this derived solution

yields lower prediction errors than simpler, more straight forward approaches. Two

such criteria are the sum and product of the two terms to be minimized [113], given

by

Qsum () = Ef () + Ec (). (3.29)

and

Qpro () = Ef ()Ec () , (3.30)

where and are regularization parameters needed to be selected. Note that mini-

mizing (3.30) is equivalent to minimizing

lg Qpro () = lg Ef () + lg Ec (). (3.31)

83
which is the logarithm of (3.30). We could use cross-validation to select the regular-

ization parameters and . Experimental results comparing these two alternative

optimization approaches with the proposed approach will be given in the experiments

section.

3.4 Applications to Regression

Let us derive two kernel-based regression approaches using the kernels and MOP

criteria derived above. In particular, we use our derived results in Kernel Ridge

Regression (KRR) and Kernel Principal Component Regression (KPCR).

3.4.1 Kernel Ridge Regression

Ridge regression (RR) is a penalized version of the ordinary least squares (OLS)

solution. More specifically, RR regularizes the OLS solution with a penalty on the

norm of the weight factor. This regularization is used to avoid overfitting. Formally,

RR is defined as

wi = (XXT + Ip )1 Xyi , i = 1, ..., q, (3.32)

where X = (x1 , ..., xn ), Ip is the p p identity matrix, yi = (y1i , ..., yni )T , and is

the regularization parameter.

We can now extend the above solution using the kernel trick. The resulting method

is know as Kernel Ridge Regression (KRR), and is given by

i = (K + In )1 yi , i = 1, ..., q, (3.33)

where, as above, K is the kernel matrix.

84
In KRR, there are two parameters to optimize: the kernel parameter (e.g., in

the RBF kernel) and the regularization parameter . In the following, we derive a

gradient descent method to simultaneously optimize the two.

Since both, the residual sum of squares term ER and the curvature term EC , are

involved in our parameter optimization problem, we need to derive the gradient of

these terms with respect to their parameters.

We start with the derivations of the RBF kernel. In this case, we have
Pq
Ec Ki )T (yi Ki )
i=1 (yi
=

Xq
(yi Ki )
= 2 (yi Ki )T
i=1
q !
X K i
T
= 2 (yi Ki ) i + K ,
i=1
K 1
where
= 3
K D, defines the Hadamard product of two matrices of the same

dimensions, i.e., (A B)ij = Aij Bij , with Aij denoting the (i, j)th entry of matrix
i
A. D = [k xi xj k2 ]i,j=1,...,n is the matrix of pairwise sample distances, and
=
(K+In )1

yi = (K + In )1 K

(K + In )1 yi = (K + In )1 K . And,
i
 
Ec q
X Tl Ml qTl l + n
=
l=1
q
n X
1 X (Tl Ri l ) (kTi l ) T
= + pi l
2n i=1 l=1
!
(pTi l )
(kTi l yil ) ,

where  
WiT
(Tl Ri l ) 2Tl Wi
l + WiT

l
4 3 Tl Ri l
= ,
4
WiT kxj xi k2 kxj xi k2

is a n p matrix whose (j, k)th entry is 3
exp( 22
)(xjk xik ), and

(kTi l ) kTi l
= l + kTi ,

85
(pTi l ) pTi l
= l + pTi ,

ki K pi
T Pp pT pT

is the ith column of


, = j=1
ij
, ij

is a n 1 vector whose mth (m 6= i)
h  
2 (xmj xij )2
entry is 1
3
exp( kxm2x2 i k ) ( 12 k xm xi k2 ) 2
1
i
22 (xmj xij )2 and ith entry is 0.

Seemingly, deriving with respect to the regularization parameter , yields

Xq
ER (yi Ki )
= 2 (yi Ki )T
i=1
q
X i
= 2 (yi Ki )T K ,
i=1

i (K+In )1
where
=
yi = (K + In )1 (K + In )1 yi = (K + In )1 i . And,
 
q T M qT + n
Ec X l l l l
=
l=1
q !
X l l
= 2Tl M qTl
l=1

When using the polynomial kernel, we cannot employ a gradient descent technique

for finding the optimal value of d, because this is discrete. Thus, we will have to try

all possible discrete values of d (within a given range) and select the degree yielding

the smallest error. The derivations of Ef with respect to are the same for any
Pq  
kernel, and Ec

= l=1 2Tl N

l
uTl

l
.

3.4.2 Kernel Principal Component Regression

Solving an overdetermined set of equations is a general problem in pattern recog-

nition. The problem is well studied when there are no collinearities (i.e., close to

linear relationships among variables), but special algorithms are needed to deal with

them. Principal Component Regression (PCR) is a regression approach designed to

86
deal with collinearities in the exploratory variables. Instead of using the original pre-

dictor variables, a subset of principal components of these are selected. By deleting

the principal components with small variances, a more stable estimate of the coef-

ficient {wi }i=1,...,q can be obtained. In this way, the large variances of {wi }i=1,...,q ,

which were caused by multicollinearities, will be greatly reduced. More formally,

m
X 1
wi = aj aTj Xyi , i = 1, ..., q, (3.34)
j=1 lj

where ai is the eigenvector of the covariance matrix associated to the ith largest

eigenvalue.

The above formulation can once again be calculated in the kernel space as,

m
X 1
i = vj vjT yi , i = 1, ..., q, (3.35)
j=1 j

f associated to the ith largest


where vi is the eigenvector of the centered kernel matrix K

eigenvalue i . This algorithm is known as Kernel Principal Component Regression

(KPCR).

In KPCR, we need to optimize two parameters the kernel parameter and the

number of eigenvectors m we want to keep. Since m is discrete, the cost function

with respect to m is non-differentiable. But testing all possible value for m is compu-

tationally expensive, because the range of m is dependent on the size of the training

set. Here, we present an alternative approach to select the optimal subset. The basic

idea is to use the percentage of the variance r to determine the number of principal
Pm
i f Note that r can now change continuously
components, r = Pi=1
t

, t is the rank of K.
i=1 i

(from 0 to 1) and can thus be incorporated in a gradient descent framework.

87
Since KPCR differs from KRR in the solution vector {i }i=1,...,q , we need to derive

i . The derivative with respect to is given by,

Xm 1 vj vT
i j j
= yi
j=1
m
!
X 1 i T 1 vi T 1 viT
= 2 vi vi + v + vi yi ,
j=1 i j i j

i
where
= viT K v,
i
vi

= (K i Id )+ K v [59], and A+ is the pseudoinverse of
i

the matrix A.

The partial derivative with respect to r cannot be given, because an explicit

definition of i as a function of r does not exist. We resolve this issue by deriving an


i
approximation to r
using a Taylor expansion. That is,

r 2
i (r + r) = i (r) + ri (r) + (r)
2! i
r 3
+ i (r) + O(r 4 ),
3!
r 2
i (r r) = i (r) ri (r) + (r)
2! i
r 3
(r) + O(r 4 ).
3! i

Combining the two equations above, we have

i (r + r) i (r r)
i (r) = + O(r 2 ).
2r

Therefore, we can write


Pm2 1 T
i i (r + r) i (r r) j=m1 +1 j vj vj yi
= ,
r 2r 2r
Pm1 Pt Pm1 +1 Pt
where m1 and m2 are selected such that i=1 i / i=1 i r r < i=1 i i=1 i
Pm2 Pt Pm2 +1 Pt
and i=1 i / i=1 i r + r < i=1 i / i=1 i .

88
Table 3.1: Results for KRR. Mean RMSE and standard deviation (in parentheses).

Kernel RBF Polynomial


Data set/Method Modified - -constraint CV GCV Modified - -constraint CV GCV
constraint constraint
Housing 2.89*(0.77) 3.01(0.78) 3.25(0.84) 4.01(1.01) 3.71(0.87) 4.38(0.99) 4.24(1.03) 8.67(6.78)
Mpg 2.51*(0.52) 2.59(0.57) 2.72(0.40) 2.61(0.52) 2.82(0.45) 3.25(0.58) 3.24(0.57) 3.21(0.80)
Slump 6.62*(1.49) 7.36(2.29) 6.70(1.53) 22.1(8.95) 7.09(1.22) 8.85(2.05) 9.86(1.53) 7.20(1.77)
Price 2.21*(0.90) 2.73(1.54) 2.42(0.90) 8.88(5.43) 3.08(1.20) 3.29(1.50) 4.01(1.48) 3.41(1.5)
Diabetes 0.55*(0.23) 0.72(0.33) 0.57(0.19) 0.88(0.31) 0.52*(0.17) 0.60(0.20) 2.31(0.87) 0.62(0.33)
Wdbc 31.46*(1.59) 32.15*(4.86) 31.50*(4.37) 50.30(9.13) 34.11(4.23) 35.12(5.21) 46.61(6.89) 32.04*(4.35)
Servo 0.51*(0.29) 0.56(0.30) 0.59(0.32) 0.81(0.52) 0.70(0.25) 0.70(0.25) 0.75(0.25) 0.65(0.27)
Puma-8nm 1.44*(0.02) 1.51(0.03) 2.42(0.05) 1.44*(0.03) 1.42*(0.02) 1.89(0.04) 1.89(0.04) 1.46(0.02)
Puma-8nh 3.65(0.03) 3.64(0.03) 3.98(0.06) 3.56*(0.04) 5.08(1.26) 5.28(0.19) 4.11(0.14) 3.61*(0.06)
Puma-8fm 1.13*(0.01) 1.19(0.02) 1.19(0.01) 1.14*(0.02) 1.27(0.01) 1.37(0.09) 1.29(0.005) 1.27(0.01)
Puma-8fh 3.23*(0.01) 3.45(0.02) 3.23*(0.01) 3.23*(0.01) 3.78(0.16) 4.86(0.14) 3.23(0.01) 3.24*(0.02)
Kin-8nm 0.11*(0.002) 0.15(0.003) 0.16(0.002) 0.19(0.02) 0.18(0.0008) 0.24(0.03) 0.22(0.002) 0.19(0.01)
Kin-8nh 0.18*(0.001) 0.18(0.002) 0.19(0.002) 0.18(0.002) 0.20(0.002) 0.29(0.006) 0.24(0.003) 0.22(0.003)
Kin-8fm 0.016(0.002) 0.016(0.03) 0.013*(0.0001) 0.339(0.202) 0.013*(0.0001) 0.02(0.0001) 0.16(0.003) 0.015(0.0001)
Kin-8fh 0.07(0.002) 0.061(0.002) 0.043*(0.0002) 0.043*(0.0002) 0.046(0.0002) 0.046(0.0002) 0.16(0.003) 0.050(0.0003)
In each kernel, the best result is in bold. The symbol * is used to indicate the top result
over all methods and kernels.

Table 3.2: Results for KPCR. Mean RMSE and standard deviation (in parentheses).

Kernel RBF Polynomial


Data set/Method Modified - -constraint CV GCV Modified - -constraint CV GCV
constraint constraint
Housing 4.04*(0.88) 4.56(0.67) 9.14(1.10) 11.99(6.89) 8.45(1.72) 9.12(2.30) 6.05(0.95) 9.37(1.77)
Mpg 3.00*(0.58) 4.64(0.82) 7.71(0.90) 3.64(1.63) 7.30(0.81) 7.82(1.54) 5.92(1.00) 8.16(1.78)
Slump 6.39*(1.53) 7.55(1.68) 9.28(1.94) 7.64(1.42) 7.68(1.88) 8.15(2.11) 8.48(2.80) 9.49(3.00)
Price 3.90*(2.16) 4.67(2.15) 12.62(2.02) 9.78(2.98) 6.06(1.93) 6.27(2.29) 5.79(1.49) 6.61(1.61)
Diabetes 0.76*(0.33) 0.96(0.43) 0.99(0.53) 0.74*(0.34) 1.01(1.47) 0.73*(0.80) 1.85(1.92) 1.23(1.31)
Wdbc 30.66*(4.71) 35.32(5.87) 33.5(4.53) 43.53(7.05) 34.47(10.27) 56.68(7.71) 47.21(13.44) 41.13(14.89)
Servo 0.71*(0.30) 1.35(0.33) 1.41(0.34) 1.29(0.40) 1.13(0.25) 1.11(0.25) 0.74(0.24) 0.81(0.24)
Puma-8nm 3.69(0.02) 3.66(0.02) 2.42(0.05) 1.75*(0.07) 3.71(0.32) 4.12(0.25) 4.13(0.53) 4.15(0.70)
Puma-8nh 4.39(0.04) 4.39(0.02) 4.56(0.13) 3.65*(0.08) 4.58(0.29) 4.84(0.22) 4.56(0.16) 5.59(0.58)
Puma-8fm 1.28*(0.05) 1.73(0.77) 4.04(1.13) 1.26*(0.01) 1.29*(0.005) 1.46(0.36) 1.56(0.61) 1.81(0.44)
Puma-8fh 3.22*(0.01) 3.33*(0.28) 3.49(0.08) 3.26*(0.07) 3.75(0.24) 3.92(0.41) 3.99(0.39) 5.04(0.74)
Kin-8nm 0.19*(0.01) 0.19*(0.01) 0.22(0.02) 0.22(0.01) 0.22(0.04) 0.21(0.03) 0.26(0.05) 0.30(0.07)
Kin-8nh 0.21*(0.007) 0.21*(0.01) 0.23(0.01) 0.24(0.01) 0.25*(0.05) 0.30(0.09) 0.27(0.05) 0.33(0.07)
Kin-8fm 0.05(0.01) 0.06(0.04) 0.03(0.007) 0.04(0.01) 0.02*(0.0001) 0.05(0.08) 0.08(0.11) 0.10(0.13)
Kin-8fh 0.06*(0.01) 0.07(0.02) 0.05(0.006) 0.06*(0.01) 0.07*(0.07) 0.07*(0.07) 0.12(0.12) 0.12(0.12)

89
3.5 Experimental results

In this section, we will use the Pareto-optimal criterion derived in this chapter to

select the appropriate kernel parameters of KRR and KPCR. Comparisons with the

state of the art as well as the alternative criteria (i.e., sum and product) defined in

the preceding section are provided.

3.5.1 Standard data-sets

We select fifteen data-sets from the UCI machine learning databases [7] and

the DELVE collections [29]. Specifically, these databases include the following sets

(in parenthesis we show the number of samples/number of dimensions): Boston

housing (506/14), auto mpg (398/8), slump(103/8), price(159/16), diabetes(43/3),

wdbc(194/33), servo(167/5), puma-8nm (8192/9), puma-8nh (8192/9), puma-8fm

(8192/9), puma-8fh (8192/9), kin-8nm (8192/9), kin-8nh (8192/9), kin-8fm (8192/9)

and kin-8fh (8192/9). The Boston housing data-set was collected by the U.S. Census

Service and describes the housing information in Boston, MA. The task is to predict

the median value of a home. The auto mpg set details fuel consumption predicted

in terms of 3 discrete and 4 continuous attributes. In the slump set, the concrete

slump is predicted by 7 different ingredients. The price data-set requires predicting

the price of a car based on 15 attributes. In the diabetes set, the goal is to predict the

level of the serum C-peptide. In the Wisconsin Diagnostic Breast Cancer (wdbc) set,

the time of the recurrence of breast cancer is predicted based on 32 measurements

of the patients. The servo set concerns a robot control problem. The rise time of a

servomechanism is predicted based on two gain settings and two choices of mechan-

ical linkages. The task in the Pumadyn is to predict angular accreditation from a

90
simulation of the dynamics of a robot arm. And, the Kin set requires us to predict

the distance of the end-effector from a target in a simulation of the forward dynamics

of an 8 link all-revolute robot arm. There are different scenarios in both Pumadyn

and Kin data-sets according to the degree of non-linearity (fairly-linear or nonlinear)

and the amount of noise (moderate or high).

To test our approach, for each data-set, we generate five random permutations and

conduct 10-fold cross-validation on each one. The mean and the standard deviations

are reported. In the experiments, we use the root mean squared error (RMSE) as our

measure of the deviation between the true response yi and the predicted response yi ,
Pn 1/2
i.e., RMSE = [n1 i=1 (yi yi )2 ] .

When using the -constraint criterion, we employ the interior-point method of

[47]. Recall that in our proposed modified -constraint criterion, we also need to

select a small scalar s. In all our experiments s = 103 .

We compare our approaches to the two typical criteria used in the literature, Cross-

Validation (CV) and Generalized Cross-Validation (GCV) [37, 96]. In particular, we

employ a 10-fold CV. The kernel parameter in the RBF is searched in the range

[ 2, + 2], where and are the mean and standard deviation of the distances

between all pairwise training samples. In the polynomial kernel, its degree is tested in

the range of 1 to 6. The regularization parameter in KRR is selected among the set

{105 , . . . , 104 }, and the percentage of variance r in KPCR is searched in the range

[0.8, 1]. Moreover, we compare our modified -constraint approach with the original

-constraint method.

91
Table 3.1 shows the regression results of KRR using both the RBF and the poly-

nomial kernels. A two-sided paired Wilcoxon signed rank test is used to check sta-

tistical significance. The error in bold is significantly smaller than the others at

significance level 0.05. We see that regardless of the kernel used, the proposed mod-

ified -constraint approaches consistently provide the smallest RMSE. We also note

that the modified -constraint approach obtains smaller RMSE than the -constraint

method.

Table 3.2 shows the regression results of KPCR using the RBF and polynomial

kernels. Once more, the proposed approach generally outperforms the others. Ad-

ditionally, as in KRR, the modified -constraint approach generally yields the best

results.

A major advantage of the proposed approach over CV is that it uses all the

training data for training. In contrast, CV needs to use part of the training data for

verification purposes. This limits the amount of training data used to fit the function

to the data.

3.5.2 Comparison with the state of the art

We now provide a comparison with the methods available in the literature and

typically employed in the above databases. Specifically, we compare our results with

Support Vector Regression (SVR) [93] with the RBF and polynomial kernels, Multiple

Kernel Learning in SVR (MKL-SVR) [76], and Gaussian Processes for Regression

(GPR) [104]. In SVR, the parameters are selected using CV. In MKL-SVR, we

employ three kernel functions: the RBF, the polynomial and the Laplacian defined as
 
kxi xj k
k(xi , xj ) = exp
. The RBF kernel parameter is set to be the mean of the

92
Table 3.3: Mean and standard deviation of RMSE of different methods.
Data set/Method Modified -constraint SVRrbf SVRpol MKL-SVR GPR
Housing 2.89(0.77) 3.45(1.04) 5.66(1.88) 3.34(0.70) 3.05(0.82)
Mpg 2.51(0.52) 2.69(0.60) 4.03(0.96) 2.67(0.61) 2.64(0.50)
Slump 6.62(1.49) 6.77(1.90) 8.37(2.86) 6.90(1.41) 6.88(1.51)
Price 2.21(0.90) 2.40(0.84) 3.72(1.55) 2.51(0.91) 11.2(2.26)
Diabetes 0.55(0.23) 0.68(0.31) 0.78(0.39) 0.65(0.35) 0.59(0.20)
Wdbc 31.46(1.59) 32.08(4.76) 44.1(9.87) 32.20(4.65) 31.60(4.3)
Servo 0.51(0.29) 0.61(0.35) 1.37(0.41) 0.60(0.36) 0.57(0.30)
Puma-8nm 1.44(0.02) 1.44(0.03) 3.35(0.11) 1.51(0.02) 1.47(0.03)
Puma-8nh 3.65(0.03) 3.67(0.06) 4.55(0.07) 3.78(0.05) 3.65(0.03)
Puma-8fm 1.13(0.01) 1.17(0.02) 2.04(0.05) 1.21(0.03) 1.17(0.02)
Puma-8fh 3.23(0.01) 3.24(0.02) 3.84(0.06) 3.35(0.05) 3.23(0.01)
Kin-8nm 0.11(0.002) 0.12(0.002) 0.21(0.003) 0.16(0.03) 0.12(0.002)
Kin-8nh 0.18(0.001) 0.19(0.003) 0.23(0.01) 0.20(0.002) 0.18(0.002)
Kin-8fm 0.016(0.002) 0.043(0.002) 0.048(0.001) 0.045(0.002) 0.013(0.00009)
Kin-8fh 0.07(0.002) 0.047(0.0009) 0.06(0.006) 0.05(0.001) 0.043(0.0007)

Table 3.4: Comparison of our results with the state of the art.
Housing Mpg Slump Price Diabetes servo Puma-8nm
Best 3.46(0.93) 2.67(0.61) 6.79(1.89) 2.62(0.87) 0.68(0.25) 0.59(0.30) 1.47(0.03)
Ours 2.89(0.77) 2.51(0.50) 6.62(1.49) 2.21(0.90) 0.55(0.23) 0.51(0.29) 1.44(0.02)
Puma-8nh Puma-8fm Puma-8fh Kin-8nm Kin-8nh Kin-8fm Kin-8fh
Best 3.65(0.03) 1.17(0.02) 3.23(0.01) 0.12(0.002) 0.18(0.002) 0.013(0.00009) 0.043(0.0007)
Ours 3.65(0.03) 1.13(0.01) 3.23(0.01) 0.11(0.002) 0.18(0.002) 0.016(0.002) 0.07(0.002)

distances between all pairwise training samples; the degree of the polynomial kernel is
2 Pn Pn
set to 2; and in the Laplacian kernel is set as = n(n+1) i=1 j=i k xi xj k, where

n is the number of training samples. MOSEK [3] is used to solve the quadratically

constrained programming problems to get the combinational coefficients of the kernel

matrices. In GPR, the hyperparameters of the mean and covariance functions are

determined by minimizing the negative log marginal likelihood of the data.

We compare the results given by the above algorithms with those obtained by our

approach applied to KRR and using the RBF kernel, because this method tends to

93
Table 3.5: Regression performance with alternative optimization criteria.
Method KRRR KRRP PCRR PCRP
Data set Ours Sum Product Ours Sum Product Ours Sum Product Ours Sum Product
Housing 2.89(0.77) 3.06(0.78) 3.30(0.85) 3.71(0.87) 4.75(0.89) 4.66(1.75) 4.04(0.88) 9.46(5.73) 6.48(3.85) 8.45(1.72) 5.56(6.77) 4.97(3.98)
Mpg 2.51(0.52) 2.63(0.50) 2.64(0.49) 2.82(0.45) 4.34(2.51) 18.04(2.59) 3.00(0.58) 4.56(0.83) 4.25(0.69) 7.30(0.81) 5.48(6.04) 4.29(3.21)
Slump 6.62(1.49) 6.87(1.51) 6.85(1.51) 7.09(1.22) 8.03(1.78) 13.17(2.84) 6.39(1.53) 7.65(1.86) 7.64(1.87) 7.68(1.88) 14.70(9.04) 16.79(9.10)
Price 2.21(0.90) 2.73(1.45) 2.76(1.46) 3.08(1.20) 2.72(1.11) 3.10(1.11) 3.90(2.16) 4.17(2.85) 4.17(2.86) 6.06(1.93) 8.76(1.99) 13.43(20.9)
Diabetes 0.55(0.23) 0.66(0.29) 0.75(0.33) 0.52(0.17) 0.60(0.21) 3.47(2.15) 0.76(0.33) 0.86(0.45) 0.86(0.43) 1.01(1.47) 1.09(1.56) 0.65(0.22)
Wdbc 31.46(1.59) 47.99(7.01) 48.60(8.98) 34.11(4.23) 51.31(16.98) 64.29(73.02) 30.66(4.71) 34.01(4.86) 38.91(5.31) 34.47(10.27) 56.61(19.90) 56.58(32.64)
Servo 0.51(0.29) 0.60(0.31) 0.57(0.30) 0.70(0.25) 0.94(0.36) 1.33(0.40) 0.71(0.30) 0.83(0.50) 0.86(0.49) 1.13(0.25) 0.70(0.27) 0.91(0.33)
Puma-8nm 1.44(0.02) 1.50(0.03) 1.50(0.03) 1.42(0.02) 3.40(0.59) 3.89(0.04) 3.69(0.02) 2.25(0.38) 2.37(0.53) 3.71(0.32) 8.50(0.87) 6.96(3.00)
Puma-8nh 3.65(0.03) 3.80(0.03) 3.86(0.04) 5.08(1.26) 4.36(0.29) 4.54(0.03) 4.39(0.04) 3.90(0.64) 3.97(0.65) 4.58(0.29) 13.82(3.81) 11.27(5.41)
Puma-8fm 1.13(0.01) 1.18(0.01) 1.18(0.01) 1.27(0.01) 1.52(0.48) 2.79(0.12) 1.28(0.05) 2.58(0.97) 2.09(0.84) 1.29(0.005) 6.98(1.23) 3.88(2.81)
Puma-8fh 3.23(0.01) 3.28(0.02) 3.28(0.02) 3.78(0.16) 3.81(0.38) 3.79(0.08) 3.22(0.01) 3.40(0.38) 3.30(0.12) 3.75(0.24) 9.78(5.64) 7.82(5.44)
Kin-8nm 0.11(0.002) 0.12(0.008) 0.13(0.01) 0.18(0.0008) 0.21(0.02) 0.68(0.35) 0.19(0.01) 0.21(0.02) 0.22(0.02) 0.22(0.04) 0.27(0.24) 0.26(0.25)
Kin-8nh 0.18(0.001) 0.19(0.002) 0.19(0.002) 0.20(0.002) 0.22(0.005) 0.42(0.27) 0.21(0.007) 0.22(0.002) 0.23(0.01) 0.25(0.005) 0.50(0.47) 0.29(0.01)
Kin-8fm 0.016(0.002) 0.020(0.0005) 0.020(0.0005) 0.013(0.0001) 0.020(0.0001) 0.57(0.22) 0.05(0.01) 0.07(0.03) 0.06(0.01) 0.02(0.0001) 0.11(0.29) 0.03(0.01)
Kin-8fh 0.07(0.002) 0.05(0.0007) 0.046(0.0005) 0.046(0.0002) 0.05(0.0001) 0.75(0.30) 0.06(0.01) 0.08(0.03) 0.08(0.02) 0.07(0.07) 0.13(0.25) 0.06(0.02)

yield more favorable results. The comparisons are shown in Table 3.3. Note that our

approach generally yields smaller RMSE.

Furthermore, for each of the data-sets described above, we provide a comparison

between our results and the best results found in the literature. For the Boston

housing data-set, [91] reports the best fits with Relevance Vector Machine (RVM);

for the Auto mpg data-set, the best result is obtained by MKL-SVR [76]; for the

Slump data, [22] proposes a k nearest neighbor based regression method and shows

its superiority over others; for the price data-set, [100] reports the best result with

pace regression; Diabetes data-set is used in [24] and the best results is obtained

using Least Angle Regression; for the servo data-set, [26] shows that regression with

random forests gets best results; and for the last eight data-sets, Gaussian processes

for regression trained with a maximum-a-posteriori approach is generally considered

to provide state of the art results [103]. The comparison across all the data-sets is

given in Table 3.4. We see that our approaches provide better or comparable results

to the top results described in the literature but with the main advantage that a

single algorithm is employed in all data-sets.

94
Table 3.6: Comparison with L2 norm.
Method KRRR KRRP PCRR PCRP
Data set Ours L2 norm Ours L2 norm Ours L2 norm Ours L2 norm
Housing 2.89(0.77) 3.45(0.95) 3.71(0.87) 4.96(0.92) 4.04(0.88) 4.36(0.96) 8.45(1.72) 7.40(1.72)
Mpg 2.51(0.52) 3.09(0.51) 2.82(0.45) 4.19(2.23) 3.00(0.58) 3.45(0.75) 7.30(0.81) 7.42(1.29)
Slump 6.62(1.49) 6.98(1.48) 7.09(1.22) 14.97(2.23) 6.39(1.53) 6.43(1.47) 7.68(1.88) 8.12(2.08)
Price 2.21(0.90) 2.81(1.21) 3.08(1.20) 2.45(3.77) 2.35(1.04) 2.73(1.31) 6.06(1.93) 5.88(1.73)
Diabetes 0.55(0.23) 0.68(0.25) 0.52(0.17) 0.78(0.20) 0.76(0.33) 0.87(0.43) 1.01(1.47) 0.94(1.40)
Wdbc 31.46(1.59) 32.10(4.56) 34.11(4.23) 42.69(13.41) 30.66(4.71) 30.69(4.66) 34.47(10.27) 45.79(15.69)
Servo 0.51(0.29) 0.90(0.31) 0.70(0.25) 0.96(0.34) 0.71(0.30) 0.73(0.31) 1.13(0.25) 1.03(0.25)
Puma-8nm 1.44(0.02) 1.47(0.03) 1.42(0.02) 3.84(0.04) 3.69(0.02) 3.37(0.04) 3.71(0.32) 4.21(0.16)
Puma-8nh 3.65(0.03) 3.75(0.03) 5.08(1.26) 4.66(0.06) 4.39(0.04) 4.19(0.14) 4.58(0.29) 4.61(0.31)
Puma-8fm 1.13(0.01) 1.23(0.01) 1.27(0.01) 1.63(0.49) 1.28(0.05) 1.26(0.003) 1.29(0.005) 1.58(0.64)
Puma-8fh 3.23(0.01) 3.23(0.01) 3.78(0.16) 4.06(0.03) 3.22(0.01) 3.30(0.12) 3.75(0.24) 3.97(0.52)
Kin-8nm 0.11(0.002) 0.17(0.001) 0.18(0.0008) 0.21(0.03) 0.19(0.01) 0.16(0.03) 0.22(0.04) 0.22(0.03)
Kin-8nh 0.18(0.001) 0.20(0.001) 0.20(0.002) 0.26(0.007) 0.21(0.007) 0.21(0.002) 0.25(0.005) 0.29(0.09)
Kin-8fm 0.016(0.002) 0.020(0.0003) 0.013(0.0001) 0.024(0.0005) 0.05(0.01) 0.03(0.003) 0.02(0.0001) 0.05(0.08)
Kin-8fh 0.07(0.002) 0.06(0.0007) 0.046(0.0002) 0.067(0.0005) 0.06(0.01) 0.06(0.004) 0.07(0.07) 0.05(0.003)

3.5.3 Alternative Optimizations

In Section 3.3.4, we presented two alternatives for combining different objective

functions the sum and the product criteria. Here we provide a comparison of these

criteria and the approach derived in this chapter. In particular, we combine model

fit Ef and model complexity Ec via the summation and product in KRR and KPCR.

The regularization term in (3.29) and in (3.31) is selected by 5-fold CV. Table

3.5 shows the corresponding regression results. In this table, AR and AP denote the

method A with a RBF and a polynomial kernel, respectively. We see that these two

alternative criteria generally perform worse than the Pareto-optimal based approach.

3.5.4 Comparison with the L2 norm

We give a comparison between our complexity measure Ec and the commonly used

L2 norm. The results are shown in Table 3.6. We see that the proposed complexity

measure generally outperforms the L2 norm in penalizing the regression function.

95
3.5.5 Age estimation

In the last two sections we want to test the derived approach on two classical

applications age estimation from faces and weather prediction.

The process of aging can cause significant changes in human facial appearances.

We used the FG-NET aging database described in [2] to model these changes. This

data-set contains 1,002 face images of 82 subjects at different ages. The age ranges

from 0 to 69. Face images include changes in illumination, pose, expression and

occlusion (e.g., glasses and beards). We warp all images to a standard size and

constant position for mouth and eyes as in [60]. All the pictures are warped to a

common size of 60 60 pixels and converted to 8-bit graylevel images. Warped

images of one individual are shown in Figure 3.4. We represent each image as a

vector concatenating all the pixels of the image, i.e., the appearance-based feature

representation.

We generate five random divisions of the data, each with 800 images for training

and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. We can see

that the modified -constraint method outperforms the other algorithms. In [115],

the authors represent the images using a set of highly redundant Haar-like features

and select relevant features using a boosting method. We implemented this method

using the same five divisions of the data. Our approach is slightly better using a

simpler appearance-based representation.

96
Figure 3.4: Sample images showing the same person at different ages.

Table 3.7: MAE of the proposed approach and the state of the art in age estimation.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR [115]
MAE 5.85 6.59 13.83 6.46 6.95 7.18 15.46 5.97

3.5.6 Weather prediction

The weather data of the University of Cambridge [102] is used in this experiment.

The maximum temperature of a day is predicted based on several parameters mea-

sured every hour during the day. These parameters include pressure, humanity, dew

point (i.e., the temperature at which a parcel of humid air must be cooled for it to

condense), wind knots, sunshine hours and rainfall. We use the data in a period of

five years (2005-2009) for training and the data between January and July of the year

2010 for testing. This corresponds to 1,701 training samples and 210 testing samples.

The results are in Table 3.8. In [77], the authors employed support vector regression

and report state of the art results. Our experiment shows that our approach performs

better than their algorithm. The predictions obtained from the modified -constraint

97
Table 3.8: RMSE of several approaches applied to weather prediction.
Modified -constraint CV GCV SVRrbf SVRpol MKL-SVR GPR
RMSE 0.81 0.83 0.90 0.87 0.95 1.07 2.53

approach are also plotted in Figure 3.5. We observe that our approach can provide

the prediction of the daily maximum temperature with high accuracy.

3.6 Conclusions

Non-linear regression is a fundamental problem in machine learning and pattern

recognition with multiple applications in science and engineering. Many approaches

have been proposed for linear regressions, but their non-linear extensions are known

to present several limitations. A major limitation is the lack of regularization of the

regressor. Without proper regularization, the complexity of the estimated function

(e.g., the degree of the polynomial describing the function) could increase rapidly,

yielding poor generalizations on the unseen testing set [74]. To resolve this prob-

lem, we have derived a roughness penalty that measures the degree of change (of the

regressed function) in the kernel space. This measure can then be used to obtain esti-

mates that (in general) generalize better to the unseen testing set. However, to achieve

this, the newly derived objective function needs to be combined with the classical one

measuring its fitness (i.e., how well the function estimates the sample vectors). Clas-

sical solutions would be to use the sum or product of the two objective functions [113].

However, we have shown that these solutions do not generally yield desirable results

in kernel methods in regression. To resolve this issue, we have proposed a multiple

98
30
true
Max temperature

predicted
20

10

0
0 100 200
Days

Figure 3.5: This figure plots the estimated (lighter dashed curve) and actual (darker dashed
curve) maximum daily temperature for a period of more than 200 days. The estimated
results are given by the algorithm proposed in this chapter.

99
optimization approach based on the idea of Pareto-optimality. In this MOP frame-

work, we have derived a novel method: the modified -constraint approach. While

the original -constraint method cannot guarantee Pareto-optimal solutions, we have

proven that the derived modified version does. Extensive evaluations with a large

variety of databases has shown that this proposed modified -constraint approach

yields better generalizations than previously proposed algorithms.

The other major contribution of the chapter has been to show how we can use the

derived approach for optimizing the kernel parameters. In any kernel method, one

always has to optimize the parameters of the kernel mapping function. The classical

approach for this task is CV. This technique suffers from two main problems. First,

it is computationally expensive. Second, and arguably most important, it cannot use

the entire sample set for training, because part of it is employed as a validation set.

But, we know that (in general) the larger the training set, the better. Our proposed

MOP framework is ideal for optimizing the kernel parameters, because it yields nice

objective functions that can be minimized with standard gradient descent techniques.

We have provided extensive comparisons of the proposed approach against CV

and GCV and the other state of the art techniques in kernel methods in regression.

We have also compared our results to those obtained with the sum and product

criteria. And, we have compared our results to the best fits found in the literature for

each of the databases. In all cases, these comparisons demonstrate that the proposed

approach yields fits that generalize better to the unseen testing sets.

100
CHAPTER 4

LOCAL DENSITY ADAPTIVE KERNELS

4.1 Introduction

The performance of the kernel methods greatly depends on the selection of the

kernel functions. An appropriate kernel function can lead to a substantial improve-

ment in the generalization ability of the learning approaches [69, 105, 10, 19]. Ideally,

the choice of the kernel function is based on the prior knowledge of the problem do-

main. Unfortunately, in general, we do not have prior knowledge on the data, and

thus have no clue on which kernel to use.

One of the most commonly used kernels in the literature is the Radius Basis Func-
 
kxi xj k2
tion (RBF), defined as, k(xi , xj ) = exp
, where is a kernel parameter.

In this kernel, data sample evaluation is equivalent to the likelihood calculation based

on Parzen windows [73, 25], which is a non-parametric density estimator. The Parzen

window size (i.e., the kernel parameter ) significantly affects the algorithms perfor-

mance. This parameter controls the size of the neighborhood centered at the point

that is being evaluated. Estimates with too large a will suffer from oversmoothing

(where the real underlying structure is obscured), while a too small will lead to a

wiggly estimate (which has too much statistical variability). It is important to note

that an important assumption associated with the use of a fixed is that the same

101
(a) (b)

Figure 4.1: A two class example. Each class is represented by a mixture of two Gaussians
with different covariance matrices. The RBF and the proposed Local-density Adaptive (LA)
kernels are evaluated on the four points marked by . (a) Density estimation in the RBF
kernel uses a fixed window, illustrated by black circles. Note that this fixed window cannot
capture different local densities. (b) Density estimation with the proposed LA kernel.

Gaussian distribution is imposed on the neighborhood of every data sample. This

means that the use of a fixed-shape kernel is only reasonable for evenly distributed

data.

However, in practice, the data is usually drawn from a complex distribution where

the local regions have distinct densities. In such cases, a kernel with a fixed shape

such as the RBF kernel will not perform well because it cannot adapt to local changes.

This problem is illustrated in Figure 4.1(a). In this figure, we see that the RBF kernel

parameter would fit some local regions well, but would not be appropriate for other

local regions with distinct densities. In these cases, the well-known overfitting and

underfitting problems [42] occur.

102
A solution to this problem is to vary the kernel bandwidth of the Parzen density

estimate based on local densities. Some approaches have been proposed in the density

estimation literature. One well-known method is the k-nearest neighbor estimate [56],

where the density is estimated by varying the window size to accommodate k-nearest

samples of a given feature vector.

A related class of approaches is called adaptive kernel estimate [9, 89, 48] which

explicitly modifies the window size according to the local data distributions. These

approaches have been shown to provide improved performance in density estimation.

However, these methods could not be directly used in most kernel-based approaches

for classification, because the resulting kernel is not guaranteed to be a Mercer kernel

[84], i.e., the corresponding kernel matrix is not positive semi-definite. This will

indeed lead to several significant problems. First, a kernel function which is not

positive semi-definite will not induce a reproducing kernel Hilbert space [84]. If the

inner product is not well defined, then the kernel trick cannot be used. Second, in

Support Vector Machines (SVM) [92], the geometric interpretation (i.e., maximizing

the margin) is only available in the case of positive semi-definite and conditionally

positive semi-definite functions [82]. Also, in such cases, the solution is unique since

the optimization problems in SVM is convex.

This chapter proposes a new class of kernels called Local-density Adaptive (LA)

kernels, which are guaranteed to be Mercer kernels. Thus, our kernels can be directly

used in any kernel-based approaches such as Kernel Discriminant Analysis (KDA) [67,

5], Kernel Principal Component Analysis (KPCA) [83, 71] and Kernel SVM (KSVM)

for nonlinear feature extraction and classification. The similarity of the pairwise

samples defined by LA kernels are constrained by the local density information, which

103
is calculated based on a weighted local variance measure. Thus, our kernels can

adaptively fit the local shape of the data while evaluating the sample similarities.

4.2 Local Density Adaptive Kernels

4.2.1 Motivation

When a kernel-based approach is employed in learning, a specific kernel function

must be selected. The Radial Basis Function (RBF) kernel is a popular choice. The

kernel parameter in this kernel is fixed for the entire data. Instead of using a

single for the estimate, it is also possible to represent the distribution using a

diagonal matrix with each diagonal entry measuring the variance of each dimension,
 
Pp (xli xlj )2
i.e., k(xi , xj ) = exp l=1 l
, where xli is the lth dimension of sample xi and p

is the dimension of the input space. Alternatively, we can use a full covariance matrix

M, i.e., k(xi , xj ) = exp((xi xj )T M1 (xi xj )), where is a scaling parameter.

This is known as the Mahalanobis kernel [1].

It is important to note that the evaluation in the above kernels assumes the data

is Gaussian distributed with fixed variance over the entire feature space. The key

idea of this chapter is to build a kernel which can automatically vary its shape (i.e.,

local variance) to adapt to local data structures.

A possible approach would be to adopt the local covariance matrix, which char-

acterizes the local structure of the data. Thus, a possible kernel function can be

formally given as,

k(xi , xj ) = exp((xi xj )T 1
ij (xi xj )) (4.1)

where ij = (i + j )/2 and i and j are the local covariance matrices centered

on the sample xi and xj , respectively. ij is a pooled covariance matrix which

104
characterizes the local density information in the neighborhoods of xi and xj . The

estimation of a local covariance matrix i centered on xi can be obtained from the

k-nearest neighbors of xi .

Eq. (4.1) seems a reasonable kernel function, since the likelihood calculation is

now given by the local distribution. However, this function is not a Mercer kernel.

Note that if a kernel function k(xi , xj ) is a Mercer kernel, there exists a mapping

function (.) : Rp F such that k(xi , xj ) = (xi )T (xj ). The kernel function in

(4.1) can be rewritten as

k(xi , xj ) = exp((xi xj )T ATij Aij (xi xj ))

= exp((Aij xi Aij xj )T (Aij xi Aij xj ))

= exp((zi zj )T (zi zj )), (4.2)

where 1 T
ij = Aij Aij and zi = Aij xi . Since (4.2) is an RBF kernel w.r.t z, there

exists a function mapping (.) : Rp F and

k(xi , xj ) = (zi )T (zj )

= (Aij xi )T (Aij xj )

= ij (xi )T ij (xj ), (4.3)

where ij (x) = (Aij x). Since ij (.) is dependent on the samples in the input space,

there does not exist a unique mapping for the kernel function in (4.1). This implies

that (4.1) is not a Mercer kernel.

4.2.2 Defining Mercer kernels

Our goal is to derive a Mercer kernel which calculates the likelihood from the

density estimation. Such a kernel k(xi , xj ) can be designed as a multiplication of two

105
kernel functions k1 (xi , xj ) and k2 (xi , xj ), i.e.,

k(xi , xj ) = k1 (xi , xj )k2 (xi , xj ). (4.4)

If k1 and k2 are both Mercer kernels, then k is also a Mercer kernel [84]. k1 can be

selected to be a likelihood evaluation kernel, such as the RBF or Mahalanobis kernel.

Then we need to build k2 , which measures the local density. To derive k2 , let us start

by presenting an important result.

Theorem 8. A kernel function (xi , xj ) = q(xi )q(xj ) is a Mercer kernel, if q(x) is

a non-negative function on x.

Proof. Let q = (q(x1 ), q(x2 ), . . . , q(xn ))T be a n 1 vector, n the number of the

samples. Then the kernel matrix K can be written as K = qqT . Thus, for any

Rn ,

T K = T qqT = (T q)2 0.

This means that the kernel matrix K is positive semi-definite. And, hence, the kernel

(xi , xj ) is a Mercer kernel.

We could thus define k2 as k2 (xi , xj ) = (xi )(xj ), where (x) 0, for all x. (x)

should be designed to reflect the density information in the neighborhood of x. One

way to achieve this is to measure the variance of the data in the neighborhood of x.

Formally,
k
1X
(x) = kxi xk2 . (4.5)
k i=1

Eq. (4.5) measures the local variance in the neighborhood of x, characterized by

the k-nearest neighbors of x. This means the local variance information is calculated

106
only from the k samples which are closest to x and the influence of the other samples

are not considered. More generally, (x) can be defined as


Pn
i=1 hx (xi )kxi xk2
(x) = Pn , (4.6)
i=1 hx (xi )

where hx (xi ) is a weighting function (i.e., a kernel) which depends on x. We note

that (4.5) is a special case of (4.6). To see this, we first denote Nk (x) as the set of

samples that are the k nearest neighbors of x. Then, a uniform kernel hx (.) is defined

as
(
1
m
, xi Nk (x)
hx (xi ) =
0, otherwise,
where m is a normalizing factor that ensures the kernel integrates to 1. This makes

(4.6) equivalent to (4.5).

Alternatively, we can incorporate the influences of all the samples in the input

space, as the soft neighborhood used in kernel regression [70, 42]. The weight of

each sample xi is calculated based on its distance from x. In this chapter, we adopt
 2

the Gaussian kernel, hx (xi ) = 1
2
exp kxi2xk
2 , where is a scaling parameter.

Therefore, our local variance measure for sample xi is formally defined as


Pn  
kxj xi k2
j=1,j6=i exp 2 2
kxj xi k2
(xi ) = Pn 
kxj xi k2

j=1,j6=i exp 2 2

Pn  
kxj xi k2
j=1 exp 2 2
kxj xi k2
= Pn 
kxj xi k2
 . (4.7)
j=1 exp 2 2
1
Note that (4.7) can be rewritten as (xi ) = tr(xi ), where tr(.) is the trace of a

matrix, and xi is the local covariance matrix


Pn  2 
j=1,j6=i exp kxj2
xi k
2 (xj xi )(xj xi )T
x i = Pn  2
 .
j=1,j6=i exp kxj2
xi k
2

The equation above shows the relationship of (4.7) and the local covariance ma-

trices, which encode the information of local distributions. To demonstrate that (4.7)

107
5

0
100
50
0 5
10
5
0
5
10 10

Figure 4.2: This figure illustrates how the local variance measurement given by (4.7) is
used. The axis represents the magnitude of the variance around each sample.

can appropriately measure the local density information, we calculate the local vari-

ances of the data in Figure 4.1 using (4.7). The results are shown in Figure 4.2. The

axis represents the local variance around each sample. We see that this local variance

measure effectively captures the local density information. The local variances are

smaller for the samples in the high density regions, and larger for the samples in the

low density regions.

It now seems that (4.7) can be readily used in our LA kernel approach. However,

a limitation of (4.7) is that it is dependent of the scale of the data since it is related

to the distances of pairwise samples. For instance, if we apply (4.7) to a large-scale

data-set, the resulting kernel matrix could have very large values in each entry, which

would lead to numerical problems. Thus, an appropriate normalization procedure

108
should be added. One way to solve this is to normalize (4.7) with the average of the

local variances about each sample, i.e.,

(xi )
s (xi ) = 1 Pn , (4.8)
n i=1 (xi )

where s (xi ) is the scale-free local variance measure.

Combining the above results, we can define our proposed LA kernel function

k(xi , xj ) as

k(xi , xj ) = s (xi )k1 (xi , xj )s (xj ). (4.9)

Recall that k1 (xi , xj ) can be any likelihood evaluation kernel function with a fixed

shape such as the RBF kernel or the Mahalanobis kernel.

Note that the kernel defined in (4.9) falls into the class of conformal kernels [84],

which define a conformal transformation preserving the angles in the kernel space.

Wu and Amari [107] use a conformal kernel to increase the influence of the samples

located around the decision boundary in an attempt to improve SVM classification.

This conformal kernel is modified in [106] to adaptively address the class imbalance

problem in SVM. Later, Gonen and Alpaydin [38] extend conformal kernels to multiple

kernel learning. In the present work, we have derived a completely different conformal

function s (x) which encodes the local density information such that the kernel can

adaptively vary its shape to fit the local data.

4.2.3 Window size

Our kernel function in (4.9) calculates the similarity of pairwise samples based

on the likelihood of the local densities. This is equivalent to evaluating the local

likelihood using windows of different sizes. A large-size window is used for the regions

where samples are distributed sparsely, while a small-size window is applied to the

109
regions where the data density is high. An advantage of the proposed kernel function

is that it can achieve this goal without changing the window size explicitly. To see

this, consider the case where the neighborhood around sample xi is sparse. That

means the local variance of xi is very large, yielding large s (xi ). When a kernel

function such as the RBF is multiplied by s (xi ), the resultant likelihood becomes

large, which is equivalent to using a large-size window (large in the RBF case).

The case for a high density region can be similarly observed. Therefore, our kernel

can adaptively change the window size of neighborhoods with different densities in

an implicit way.

Moreover, note that a fixed-shape kernel function such as the RBF kernel is a

special case of our kernel, where the local variance measure s (xi ) is a constant for

every sample xi . Thus the function does not need to incorporate information of the

local density.

4.2.4 A case study

We provide a case study with the purpose of demonstrating the utility and advan-

tages of the newly derived kernel. We employ the RBF function in k1 (xi , xj ), since

we want to make a comparison between the proposed kernel and the RBF.

We generated a set of 500 samples for each of the two classes in the XOR problem,

Figure 4.3 (a). Each class is represented by a mixture of two Gaussians, i.e., two

subclasses per class. The means of these 4 subclasses are designed so that the data is

distributed in a XOR fashion, Figure 4.3 (a). In each class, the covariance matrices

of each subclass have different scales, controlled by a factor c such that Si2 = cSi1 ,

where Sij denotes the covariance matrix of the j th subclass in class i. The larger c is,

110
4
LA kernel
RBF kernel
2
0.85

Classification accuracy
0
0.8
2
0.75
4
0.7
6

8 0 20 40 60 80 100 120
5 0 5 Covariance factor c

(a) (b)

Figure 4.3: (a) A case study with synthetic data simulating the classical XOR problem.
(b) classification accuracies of the proposed LA and RBF kernels under different covariance
factors c. The proposed kernel obtains higher classification accuracies than the RBF as c
increases.

the more different the two covariance matrices become. Thus, this data-set provides

the performance evaluation of the kernels under different conditions where the local

regions have different densities.

We let KSVM be our classifier. The kernel parameters in the RBF kernel and

the proposed LA kernel are tuned using 5-fold cross-validation (CV). We then calcu-

late the classification accuracies using an independent test set drawn from the same

distributions.

We plot the classification accuracies with respect to different covariance factors c

in Figure 4.3 (b). We see that as c increases, the RBF results degrade rapidly, whereas

that of the LA kernel does not. This is because as the local regions become more

different in density, the LA kernel adapts to this density differences. We demonstrate

the utility of the proposed LA kernel using a variety of data-sets in Section 4.4.

111
4.3 Kernel Parameter Selection

Now that we have derived the LA kernel, the next question to answer is how to

select the kernel parameters. Given the kernel function, the success of the kernel

approach greatly depends on the selection of its kernel parameters. Next, we present

two mechanisms to achieve this goal.

4.3.1 k-fold cross-validation

A commonly used criterion to do parameter selection is k-fold cross-validation

(CV) [42]. The training data is first partitioned into k parts. k-1 parts are used for

training, the other for validation. This process is repeated k times for each possible

value of the parameters. The parameters leading to the largest average validation

accuracy are selected.

4.3.2 Kernel Bayes accuracy criterion

A first major problem with CV is its complexity. The training process has to be

repeated k times and the parameters are selected based on an exhaustive grid search.

When it is applied to a large-scale database, it becomes very time consuming, which

limits its use in practice. The seconde major problem of CV is that only part of the

training data is used to estimate the model parameters. In general, one wishes to use

the largest possible number of training samples in search of better generalizations [63].

Here, we explore the KBA criterion [114] as in (5.7). It is an efficient approximation

of the Bayes classification accuracy in the kernel space.

We use this criterion to determine the optimal kernel parameters, i.e.,

, = arg max J(, ). (4.10)


,

112
4.4 Experimental Results

We present the results on various classification problems. We provide results with

k1 (xi , xj ) in (4.9) equal to the RBF and the Mahalanobis kernels, which we denote

LAR and LAM , respectively. We provide a comparison of the performance of our LA

kernels with the classical use of RBF and the Mahalanobis (denoted MA).

We apply kernel learning to three well-known approaches: KSVM, KDA and

KSDA [113]. The parameter , and in these kernels, as well as the number of

subclasses of each class in KSDA are selected using the two criteria defined above:

5-fold CV and KBA. The regularization parameter in KSVM is also selected using

CV. In KDA and KSDA, we employ the nearest mean (NM) and the nearest neighbor

(NN) classifiers, denoted as KDAN M , KDAN N , KSDAN M and KSDAN N respectively.

4.4.1 UCI benchmark data-sets

We apply the derived LA kernels to seven benchmark data-sets from the UCI

repository [7]. In the Monks problem, the goal is to discriminate two distinct postures

of a robot. Monk 1, 2, and 3 denote three different cases in this task. The NIH Pima

data-set is used for the detection of diabetes from eight measurements. In the BUPA

set liver disorders are detected from a blood test. The task in the Breast Cancer

data-set is to distinguish two classes: no-recurrence and recurrence. And, the goal of

the image segmentation data-set is to classify seven outdoor object categories from a

set of 3 3 image patches.

The classification results of these data-sets using CV are presented in Table 4.1.

In KDA and KSDA, the proposed kernels generally outperform the RBF and the

Mahalanobis kernels, regardless of the classifiers used in the reduced space. A similar

113
Table 4.1: Recognition rates (%) with CV in UCI data-sets.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 87.7 83.6 90.3 83.1 90.3 83.1 88.2 88.0 88.4 87.5
Monk 2 85.2 82.6 72.5 70.1 74.5 70.1 73.8 74.5 74.1 75.7
Monk 3 97.0 93.5 91.7 82.4 92.1 82.4 94.0 94.0 94.2 89.8
Pima 78.6 79.2 80.4 72.6 72.6 72.6 79.2 77.4 77.4 74.4
Liver 71.0 68.1 66.7 69.6 63.8 69.6 69.6 66.7 65.2 65.2
B. Cancer 72.8 70.1 68.8 67.5 68.8 66.2 68.8 59.7 66.2 71.4
Image-seg 93.3 91.2 93.1 90.7 94.1 93.0 93.1 90.7 94.1 93.0
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 89.6 82.6 84.5 81.0 85.2 81.9 85.0 81.0 85.0 81.9
Monk 2 83.6 82.4 71.5 73.8 75.5 77.8 79.6 78.9 81.3 81.3
Monk 3 94.2 93.1 94.4 93.1 92.6 91.7 94.4 93.1 92.6 94.0
Pima 76.2 76.2 79.8 78.6 76.2 75.0 78.6 76.8 76.2 72.6
Liver 73.9 72.5 71.1 68.1 71.1 68.1 71.1 66.7 71.1 63.8
B. Cancer 74.0 68.8 66.2 72.7 68.8 62.2 66.2 63.6 68.8 65.0
Image-seg 92.2 93.4 92.1 91.5 92.4 91.8 91.1 90.9 91.1 90.6
The higher classification accuracies are bolded.

observation can be made in KSVM, where the proposed LA kernels provide higher

classification accuracies. The results with the KBA criterion are shown in Table 4.2.

We see that although the kernel parameters are now selected using a different criterion,

the proposed kernels still outperform classical kernels in most of the data-sets.

4.4.2 Image databases

To further demonstrate the utility of the proposed LA kernels in real-world ap-

plications, we apply them to two image databases. The first database we will use

is the ETH-80 [53]. This database is described in the previous chapters. We adopt

the typical leave-one-object-out test, i.e., the 41 images of one of the 80 objects are

114
Table 4.2: Recognition rates (%) with KBA criterion in UCI data-sets.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
Data set LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
Monk 1 94.7 94.0 86.8 87.3 87.5 87.3 87.0 88.0 88.1 89.4
Monk 2 82.2 79.9 78.7 78.2 79.2 82.9 79.2 78.2 80.6 82.9
Monk 3 96.5 95.1 94.9 92.6 93.1 92.6 96.3 94.0 95.1 94.7
Pima 77.4 81.5 81.6 76.2 78.6 76.2 81.6 78.6 78.6 73.2
Liver 72.5 68.1 65.2 65.2 60.9 63.7 69.6 65.2 66.7 59.4
B. Cancer 72.7 70.1 67.5 66.2 62.3 61.0 67.5 63.6 66.2 64.9
Image-seg 91.3 91.3 88.0 92.0 93.0 94.1 90.2 92.0 93.2 92.9
Data set LAM MA LAM MA LAM MA LAM MA LAM MA
Monk 1 85.2 84.7 82.9 85.2 83.1 85.2 85.0 84.0 83.8 83.6
Monk 2 83.1 83.8 82.2 83.3 82.6 83.6 80.6 79.0 83.1 81.5
Monk 3 94.0 92.8 93.3 91.4 92.1 91.4 94.7 93.1 94.7 93.3
Pima 82.1 76.8 81.0 76.2 74.4 70.8 78.6 78.0 72.6 76.8
Liver 73.9 71.0 69.6 68.1 69.6 68.1 69.6 68.1 69.6 68.1
B. Cancer 70.1 68.8 63.6 62.3 66.2 62.3 66.2 67.5 67.5 61.0
Image-seg 91.5 89.0 92.1 90.0 92.7 90.1 91.5 90.0 92.3 90.1

used for testing and the images of the rest of the objects are used for training. This

process is repeated 80 times and the average recognition rate is reported.

The results are shown in Table 4.3 and 4.4. We see that our kernels generally out-

perform the RBF and the Mahalanobis kernels. Note that there is a big improvement

in KDA and KSDA in Table 4.3.

We also use the CMU PIE face database [86]. This database contains 68 subjects

with a total of 41,368 images. The face images were obtained under varying pose,

illumination and expression. We select the five near-frontal poses (C05, C07, C09,

C27, C29) and use all the images under different illuminations and expressions -

around 170 images for each person.

115
Table 4.3: Recognition rates (%) with CV in ETH-80.
KSVM KDAN M KDAN N KSDAN M KSDAN N
LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
83.6 81.8 80.4 71.6 80.2 71.6 80.4 71.6 80.2 71.6
LAM MA LAM MA LAM MA LAM MA LAM MA
77.0 74.6 76.6 70.2 77.0 70.4 76.6 70.2 77.0 70.4

Table 4.4: Recognition rates (%) with KBA criterion in ETH-80.


KSVM KDAN M KDAN N KSDAN M KSDAN N
LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
82.0 81.6 81.2 84.6 81.6 84.6 81.2 84.6 81.6 84.6
LAM MA LAM MA LAM MA LAM MA LAM MA
75.0 73.6 77.8 71.5 76.6 70.8 77.8 71.5 76.6 70.8

Figure 4.4: Shown here are sample images from PIE data-set.

116
Table 4.5: Recognition rates (%) with CV in PIE database.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 72.5 69.7 75.3 72.6 75.9 73.1 75.3 72.6 75.9 73.1
20 89.2 87.6 93.4 92.1 94.2 92.4 93.3 84.8 94.3 84.8
40 94.7 93.2 96.3 94.6 96.7 94.9 96.5 92.6 96.8 92.6
60 96.6 95.5 97.3 96.1 97.4 96.5 97.3 96.4 97.4 96.4
80 98.4 98.0 97.7 96.8 98.0 97.2 97.7 96.5 98.0 96.4
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 73.6 71.2 66.9 61.0 66.8 61.0 66.9 61.0 66.8 61.0
20 89.6 88.5 89.8 85.8 89.7 85.7 88.0 83.3 87.9 83.1
40 93.4 92.7 91.2 89.6 91.4 89.6 92.5 90.7 92.3 90.6
60 95.8 94.9 93.3 91.2 93.3 91.5 93.3 91.2 93.3 91.5
80 97.8 96.7 95.7 93.4 95.9 93.2 95.7 93.4 95.9 93.2

All the face images were aligned, cropped and resized to a standard size of 32 32

pixels. Some sample images are shown in Figure 4.4. For each individual, we randomly

selected N (N=5, 20, 40, 60, 80) images for training and used the rest for testing.

The comparative results obtained from KSVM, KDA and KSDA are shown in Table

4.5 and 4.6. The LA kernel consistently achieves better recognition performance than

the RBF and the Mahalanobis kernels. Again, this illustrates the effectiveness of the

proposed approach.

4.5 Conclusions

The selection of a kernel function is a main issue in kernel-based learning. An ap-

propriately selected kernel function greatly increases the generalization performance

of the learning approach. This chapter proposes a class of density adaptive Mercer

kernels which evaluate the sample similarity by taking into account the local data

117
Table 4.6: Recognition rates (%) with KBA criterion in PIE database.
Method KSVM KDAN M KDAN N KSDAN M KSDAN N
N LAR RBF LAR RBF LAR RBF LAR RBF LAR RBF
5 73.6 70.1 75.2 72.6 75.8 73.0 75.2 72.6 75.8 73.0
20 88.7 81.7 91.8 84.8 92.0 84.8 92.0 72.1 92.1 72.1
40 94.7 91.4 95.2 93.2 95.9 93.3 96.1 93.2 96.2 93.3
60 96.7 95.0 97.0 95.5 96.9 95.5 97.3 95.5 97.4 95.5
80 97.6 96.9 97.5 96.6 97.7 96.6 97.7 96.6 98.0 96.6
N LAM MA LAM MA LAM MA LAM MA LAM MA
5 65.2 60.6 61.5 57.1 61.4 57.1 64.9 59.6 64.8 59.6
20 91.2 80.9 88.3 84.0 88.2 84.0 84.7 79.3 84.7 79.3
40 95.3 92.5 93.3 91.6 93.2 91.5 93.3 91.6 93.2 91.5
60 96.4 93.9 94.5 92.8 94.5 92.8 94.5 92.8 94.5 92.8
80 98.0 96.5 96.6 93.4 96.7 93.5 96.6 93.4 96.7 93.5

density. While the commonly used kernels such as the RBF and the Mahalanobis

kernels evaluate the entire data using a fixed window, the kernels derived in this

chapter can automatically adjust their window size to adapt to local regions with

different densities. This enables them to effectively handle data with multiple distri-

bution forms. The proposed LA kernel approach was successfully applied to KSVM,

KDA and KSDA and shown to yield higher classification accuracies than classical

options, e.g., RBF and Mahalanobis kernels.

118
CHAPTER 5

KERNEL MATRIX LEARNING WITH GENETIC


ALGORITHMS

5.1 Introduction

Thus far, we have focused on the model selection problem where the kernel pa-

rameters are learned given a known kernel function. In many applications, however,

we do not have prior knowledge of the data. Thus, we do not know which kernel

function may perform better. A major open problem in kernel learning is to define

algorithms that find the kernel mapping function best suited to most problems. Ide-

ally, we want to find an appropriate kernel mapping without having to pre-specify

the kernel function (such as the typically employed RBF kernel).

Instead of learning the kernel parameters of a given kernel function, one could

try to directly learn the kernel matrix. Multiple kernel learning attempts to do just

that by combining a set of known kernel maps. For example, Cristianini et al. [18]

represent the kernel matrix as a linear combination of several pre-defined kernels.

The coefficients determining how to combine the kernels are learned by aligning the

matrices with a target label matrix. Other authors, [51, 4, 49, 111] employ convex

optimization techniques within the context of Support Vector Machines (SVM) and

Kernel Discriminant Analysis (KDA). [50] proposes to learn the kernel matrix by

119
penalizing an Lp norm of the combination coefficients, leading to a more general

framework of multiple kernel learning. And, Crammer et al. [15] propose a boosting

approach based on the exponential and logarithmic loss. Finally, several nonlinear

combinations of kernels have also been recently defined [95, 14].

The multiple kernel learning approach just described, however, suffers from two

main limitations. First, an explicit formulation to combine different kernels has to

be pre-specified. As it is common, some methods work best in one application while

others outperform them in different settings. Second, the kernel matrix can only be

searched within the space defined by these pre-specified functions. If the kernels and

their parameters are not appropriately specified, the learned kernel matrix will not

perform well in classification.

In this chapter, we derive an approach that overcomes these difficulties. Our ap-

proach borrows ideas from Genetic Algorithms (GA) modify a large set (population)

of randomly initialized kernel matrices to optimize the metric induced by the kernel

mapping without the need to know the underlying kernel function. By doing so, we

also avoid the need to combine or optimize over several possible (or known) kernel

matrices.

Key to our approach is the definition of several novel operators in GA. The two

classical operators used in the literature are crossover and mutation. The former,

combines two or more individuals of the current population to generate an individual

of the next generation (called offspring). The second operator in GA adds random

mutations to existing individuals. These two procedures are however not sufficient to

efficiently search vast spaces [68], such as the one defining all possible kernel matrices.

In the present work, we derive three additional GA operators to facilitate this search.

120
One of the new operators emulates gene transposition. Consider the genome of a

species. Transposons are chunks of DNA that can move from one part of this genome

to another. This process was first described by Nobel Laureate Barbara McClintock

[64], when she noticed that the color changing pattern seen in corn is not random.

This effect was originally referred to as jumping genes. A typical gene transposition

is given by the cut-and-paste transposon. Here, enzymes cut a section of the DNA

and then insert it elsewhere. In our case, each genome describes a kernel matrix. A

cut-and-paste transposon will move a section of the matrix to another. As a result,

the classification function seen in one area of the feature space will now be applied

to another section of the space. If this results in a lower classification error rate, the

new matrix will be preferred over the old ones.

Another typical operator is insertions [81]. A typical case is that of viruses.

Lacking a reproductive system, viruses need to insert their genome into that of the

invaded cell for replication. By doing so, gene coding and non-coding sections of the

host genome can vary. In our case, the insertion of a new section in the matrix could

resolve misclassification in a localized section of the feature space.

Our third operator is deletion. In living organisms, sections of the genome may

be deleted during meiosis [81]. In our case, deletion of a section of the matrix could

rearrange the classifiers (i.e., norm defined by the kernel) in a positive way.

The GA operators defined above facilitate the search through a vast domain, thus

addressing the problem of multiple kernel learning listed above. After the matrices

of the current population have been modified to create the offsprings, we eliminate

those yielding the worst sample classification accuracies. The process is iterated until

convergence.

121
A problem with approaches that directly learn the kernel matrix (with no known

associated kernel function) is that they lack the capacity to map the test samples

to the kernel space. A common solution is to employ transductive learning [33].

Here, the testing data is used in combination with the training samples to resolve

the problem. Each time the testing data changes, the algorithm will compute a new

kernel matrix which can be used for both, the training and testing sets. This approach

is computationally expensive and is not guaranteed to provide good results on the

test data because the kernel matrix has not been optimized for them.

To resolve these problems, we derive a regression-based method which estimates

the kernel values encoding the similarity between the training and testing samples

allowing us to map any new test sample. This eliminates the need of having to

relearn the kernel matrix each time a new test sample is to be classified. Our solution

is equivalent to estimating the underlying function represented by the learned kernel

matrix. We show that this approach yields superior results to transductive learning

since it directly represents the learned function rather than the training sample alone.

The rest of the chapter is organized as follows. Section 5.2 introduces the nuts

and bolts of the proposed genetic algorithm search. Section 5.3 derives the non-

linear regression learning of the underlying function defined by the kernel solution

for its application in classification. Section 5.4 does the same for regression. Section

5.5 provides comparative results with state-of-the-art algorithms. Conclusions are

presented in Section 5.6.

122
5.2 Learning with Genetic Algorithms

We start with a collection of p kernel matrices generated at random, {K1 , . . . , Kp }.

The current population is iteratively modified using a genetic-based algorithm until

convergence.

5.2.1 Feature representation

Genetic Algorithms (GA) constitute a set of tools that are well suited for solving

mathematical optimization problems in large spaces where there are multiple local

minima and no clear indication of how to find them [36]. This is especially practical

when the search space is so vast that, despite computational improvements, one would

require years (and potentially centuries) to solve the problems if a reasonable area of

the search space were to be explored.

In GA, we start with a set of genomes, each representing an individual. This

set of individuals is called the population. The first key step in GA is to define an

appropriate coding of the problem data as a genome. The most typical coding is

a feature vector with each element defining one of the parameters (or features or

variables) that play a role in our optimization problem. In this representation, each

entry in the feature vector codes for a directly relevant variable in the optimization

problem, Fig. 5.1(a).

In contrast to the classical coding approach described in the preceding paragraph,

we include non-coding segments in the feature vector (i.e., genome). As any biological

systems, the coding and non-coding segments alternate one another, Fig. 5.1(b). The

coding segments will be referred as genes (because they code for the kernel matrix

K which is our end result or outcome). This emulates the coding seen in actual

123
(a)

(b)

Figure 5.1: (a) The classical feature representation. Each entry in the feature vector codes
for a relevant variable in the optimization problem. (b) The proposed feature representation.
Each individual in the population is represented as a feature vector with coding and non-
coding segments. The lower case letters represent the coding (or gene) sequence used for
the calculation of the fitness function. Consecutive N labels indicate non-coding DNA.

biological systems. The elements defining the gene sequences are obtained from the

elements of L, with K = LT L, where K is a kernel matrix, whereas the values of

the non-coding DNA sequences are generated at random. Each gene is preceded by a

fixed sequence (or gene marker). This specifies where each of the genes starts in the

genome. This is the typical approach used by cells in biology.

To reconstruct a kernel matrix from an individual (genome), we work as follows.

First we identify the positions of the gene markers, indicating where each coding DNA

sequence starts. Since the genes are of a specified length, they can be easily read,

concatenated and reshaped back to L. The kernel matrix K is then given by LT L.

124
The genome representation defined in this section will allow us to derive novel

operators, such as transposition, deletion and insertion. This is so because we can now

make use of the non-coding sections of the genome to address some of the limitations

of earlier operators. We discuss this in the sections to follow.

5.2.2 Basic operators

Most GA use two major operators crossover and mutations. In crossover, two
[t] [t]
individuals, ui and uj , of the current population (i.e., two kernel matrices in our
[t]
case) are selected at random. Here, ui = (ui1 , . . . , uiq )T Rq , t specifies the iter-

ation or population cycle, and i, j {1, . . . , p}, p the number of individuals in the
[t+1]
population. An integer r [1, q] is selected at random. Two offsprings ui and
[t+1]
uj (i.e., two individuals of the new generation) are obtained as

[t+1]
ui = (ui1 , . . . , uir , ujr+1, . . . , ujq )T
[t+1]
uj = (uj1 , . . . , ujr , uir+1, . . . , uiq )T . (5.1)

By combining two existing (good) solutions, we construct alternative kernel matrices

form a distant area of the search space which could yield even higher classification

rates. While one of the matrices (say, ui ) helps classify samples in a region of the

feature space, the other matrix could be instrumental in the classification of the

samples in the rest of the feature space.

The mutation procedure is meant to add random jumps within the search space

which are unlikely to occur with crossover or gradient descent techniques. Some

mutations will add small changes, with the aim to jump over a local minimum. Other

mutations will add large changes, moving the search to a completely different region

of the search space. The mutation operation works as follows. An individual from the

125
Figure 5.2: This figure illustrates the copy-and-paste transposition.

[t]
current population is selected at random, uk . A number s of its entries are randomly
[t]
selected, with s = qpm ; pm the mutation rate. Each of these entries uk (li ) is

replaced by a random number bi as follows,

[t]
uk (li ) = bi , li M, i = 1, ..., s. (5.2)

where M is the set containing the indices of the s selected entries. The mutation

value used in the above equation is bounded by the minimum and maximum of all
[t]
the entries of uk .

5.2.3 Transposition

While crossover and mutation are typically used in GA, nature makes use of a

large variety of tools to modify individuals in a population [81]. Here, we present

mathematical models of three of these transposition, deletion and insertion.

As summarized earlier, transposition refers to chunks of DNA that move from

one location to another within the genome. In our search space, transposition would

apply a local norm (or classifier) to a different region of the feature space. A norm

that does not work well in one area of the space, may be what is needed in another.

126
We model two major transposition mechanisms. The first one is called copy-and-

paste. Here, a short sequence of DNA is copied to RNA by transcription, and then

copied back into (inserted as) DNA by reverse transcription at a new position. This is

illustrated in Figure 5.2. Due to transcription noise, the copied sequence may diverge

slightly from its former self. To model this, let v be a transposon, v = (v1 , ..., vLt )T ,

where Lt its length. And, assume each entry of v is perturbed by a small Gaussian

noise with a probability pv , i.e.,

vi = vi + szi , i P,

where vi is the entry after perturbation, s is the scale of the Gaussian noise, zi

N (0, 1), and P is the set containing the indices of the perturbed entries. Suppose a

genome u is selected and the insertion position is t, after copy-and-paste this becomes

u = (u1 , ..., ut , vi1



, ..., viLt
, ut+1 , ..., uq )T . (5.3)

The second transposition mechanism we will model is called cut-and-paste. In

this case, a sequence of DNA is cut from its original position and inserted into a new

position of the same genome, Figure 5.3. Since this process does not involve an RNA

intermediate, it is not affected by noise. Formally, denote the cut position t0 (with

t0 < t). Using the same notation above, we define the new individual u as

u = (u1 , ..., ut01 , ut0+Lt , ..., ut , vi1 , ..., viLt , ut+1 , ..., uq )T . (5.4)

The two transposition procedures described above work as follows. First, individ-

uals are selected at random at a transposition rate pt . A transposition location is

selected from a random location in the genome and used in either copy-and-paste or

cut-and-paste (at 50% each). Finally the transposon is inserted into a randomly cho-

sen position. Note that in the copy-and-paste mechanism, the length of the genome

127
Figure 5.3: This figure illustrates the cut-and-paste transposition.

is increased. This would not admissible if we were using the classical feature repre-

sentation, but is not an issue when we employ the coding-non-coding model defined

in the preceding section.

5.2.4 Deletion and insertion

Next, we propose a mathematical model of deletions. In genetics, a deletion is a

type of genetic aberration in which a sequence of DNA of a genome is missing. Any

number of nucleotides can be deleted, from a single base pair to an entire piece of a

genome. In nature, deletion is generally harmful, but, in some occasions, can lead to

advantageous variations.

To model this process, we work as follows. An individual u is selected with prob-

ability (or, deletion rate) pd . A DNA segment v of u, starting at a random position

t, is chosen for deletion. More formally, denote u = (u1 , ..., uq )T , v = (v1 , ..., vLd )T ,

u is the genome after deletion. Then,

u = (u1 , ..., ut1 , ut+Ld , ..., uq )T . (5.5)

The deletion length Ld is a random variable and is modeled as Ld N (Ld , L2 d ),

where Ld and L2 d are the mean and variance of Ld , respectively.

128
(a)

(b)

Figure 5.4: This figure illustrates gene deletion operation for two cases. (a) Only a non-
coding sequence is deleted. (b) A part of gene is deleted and a new gene is formed.

Note that the length of the genome is hence decreased. Since the position of

deletion is chosen at random, it is possible that only a sequence of non-coding DNA

is deleted, Figure 5.4(a). It is also possible to delete a coding part. In this latter case,

the non-coding DNA right after the deleted segment becomes the coding segment,

Figure 5.4(b).

Deletion can eliminate a local norm (or classifier) that was causing problems and

substitute this for a randomly initialized alternative that can be improved with the

other optimization tools. This procedure can be especially useful for leaving large

(close-to) flat areas of the optimization function.

Our final operator models insertions. In genetics, an insertion is a type of genetic

aberration in which a DNA sequence is inserted into a genome. A common cause

of insertions is viral infections, where viruses integrate their genome into that of the

129
host cell. The effect of insertion depends greatly on the location within the hosts

genome.

To model insertions, we define a population of viruses Q = {q1 , ..., qr }, where qi

is a Lv 1 vector, and r is the size of the population. The virus population is allowed

to evolve with the mutation operator from generation to generation.

Genomes u are selected at insertion rate pi . A virus qj is selected at random from

Q. A position t in u is randomly chosen and qj is inserted to u at t. The resulting

individual (after insertion) is given by

u = (u1 , ..., ut , qj1 , ..., qjLv , ut+1 , ..., uq )T . (5.6)

5.2.5 Selection criterion

The operators described above are used to generate d > p individuals. The number

of offsprings d is usually twice or larger than p. A selection criterion is then employed

to determine the best fitted p individuals that will survive and thus become the

member of the population at time t + 1.

The process starts with a population of p individuals generated at random and

from pre-specified kernel functions. This process combines the characteristics of differ-

ent kernel functions and introduces much needed randomness to the initial population.
[0]
The initial population set is formally defined as {K1 , . . . , K[0]
p }.

A selection criterion is then used to determine the most fitted individuals that

are to survive to the next iteration. Since our goal is classification accuracy, we

employ the Bayes accuracy criterion of [114], which is the one minus the Bayes error

as calculated in the kernel space.

130
More formally, let X = {x11 , . . . , x1n1 , . . . , xCnC } be a given training set, where

xij is the j th sample in class i, ni is the number of samples in class i and C is the total

number of classes. Let (.) : Rl F be a function defining the kernel map, where l

is the dimension of the input space. We assume that data has been whitened in the

kernel space, and denote K as the whitened kernel matrix for the training samples,

i.e., K = (X)T (X), where (X) = ((x11 ), . . . , (xini ), ..., (xCnC )). Then, the

kernel Bayes accuracy (KBA) criterion is given by


C1
X C
X
J() = pi pj w(
ij )tr(Sij ), (5.7)
i=1 j=i+1

where pi is the prior of class i,


ij is the Mahalanobis distance in the kernel space,
q
defined as
ij = 1Ti Kii1i 21Ti Kij 1j + 1Tj Kjj 1j , Kij = (Xi )T (Xj ) is the subset

of the kernel matrix for the samples in class i and j, (Xi ) = ((xi1 ), . . . , (xini )), 1i

is a ni 1 vector with all elements equal to 1/ni , w(.) is a weighting function, with
1 Rx 2
w( 2 et dt is the error function, and Sij is the
ij ) = 2
2 erf ( 2ij2 ), erf (x) = 0
ij

kernelized between-class scatter matrix, with Sij = (Ki 1i Kj 1j )(Ki 1i Kj 1j )T ,

Ki = (X)T (Xi ) the subset of the kernel matrix for the samples in class i.

Optimizing (5.7) yields a kernel matrix K corresponding to a kernel representa-

tion where the Bayes error is minimized. This is given by

K = arg max J(K). (5.8)


K

The kernel Bayes accuracy criterion defined in (5.7) is used to evaluate the fitness

of these d offsprings, i.e., gi = J(Ki ), i = 1, .., d, where gi is the fitness value of the

ith genome. Then, the individuals that will form the new population are selected as

follows. First, an elitist selection strategy is applied. This means that the pf best

fitted individuals are kept. Another set of pn is randomly selected from the bottom

131
10%, i.e., the less fitted individuals. The values of pf and pn are selected to be

approximately 5% of p. The first group is used to guarantee fast convergence. The

second group is used to maintain diversity in the population, which may help us jump

away from local minima in the future. The rest of the individuals p pf pn are

selected at random using a roulette wheel rule [36]. In the roulette wheel rule, the

probability of selecting the ith individual is given by

gi
pi = Pdpf pn , (5.9)
i=1 gi

where pi is the probability with which the ith individual is selected.

The procedure described in this section is iterated until convergence. Convergence

is given by
[t+1] [t]
|gm gm | < , (5.10)

[t]
where gm is the maximum fitness value of the population at iteration t, and > 0 is

small.

To avoid problems with random initialization, we run the proposed approach

multiple times with different initial populations, then keep the solution with the

best fitted individual of the final populations. The proposed kernel matrix learning

algorithm is summarized in Algorithm 5.1.

5.3 Generalizing to Test Samples in Classification

Once we learn the optimal kernel matrix of the training data from Algorithm 5.1,

we can use it in any kernel-based approaches such as KDA [67, 5], Kernel Subclass

DA (KSDA) [113] and SVM [92]. However, the only information we have is the

kernel matrix for the training data and we do not know the corresponding explicit

132
Algorithm 5.1 Kernel Matrix Learning with GA
Input: Training set, xi , ..., xn ,
Output: kernel matrix K
for i = 1 to a do
Generate initial population with K1 , ..., Kp .
repeat
1. Generate new individuals with operators from (5.1) to (5.6).
2. Calculate the fitness values gi of the new individual using (5.7).
3. Select survivors using (5.9).
[t+1] [t]
until |gm gm |<
Output: The most fitted individual, K(i).
end for
K = maxi J(K(i))
Return: K

kernel function to construct the kernel values which measure the similarity between

the training and testing samples.

This is a general problem in kernel matrix learning. A common solution is to

cast the classification problem as a transductive one. Given the labeled training and

unlabeled test samples, one generates a common kernel matrix including the two sets.

The kernel matrix is learned using the available approach such as the one defined

in this chapter. This means we need to relearn the kernel matrix each time a new

testing sample becomes available. One could say that the learned mapping does not

generalize to new samples.

In the present section, we propose a novel solution to the above defined problem.

The idea is to estimate the underlying function represented by the learned kernel

matrix using regression. Formally, given X, i.e., a set of n training samples with

known predictor vectors yi = (hx1 , xi i, . . . , hxn , xi i)T Rn , where hxi , xj i is

the (i, j)th entry in the learned kernel matrix, we want to find the function f(.)

133
providing the best estimate of the true (but unknown) underlying function, where

f(x) = (f1 (x), ..., fn (x))T , and fi (.) : Rl Rn is the ith regression function.

Let fi (x) = kxi (x) = hxi , xi. To learn this underlying function, we need to use

a non-linear approach. Kernel Ridge Regression (KRR) [42] provides the necessary

flexibility and computational efficiency for this task. KRR minimizes the cost function

n
1X T
L(W ) = kyi W (xi )k22 + kW k2F , (5.11)
n i=1

where (.) is a function defining the kernel mapping, W is a projection matrix in

the kernel space, is a regularization parameter, k k2 denotes the Euclidean norm

of a vector and k kF is the Frobenius norm of a matrix.

The solution of the regressed function is given by

f(x) = Y(G + In )1 g(x), (5.12)

where Y = (y1 , ..., yn ) is an n n predictor matrix, G is the Gram matrix with its

(i, j)th entry defined as Gij = g(xi , xj ) for some known kernel function g, In is the

n n identity matrix, and g(x) = (g(x1 , x), ..., g(xn , x))T .

When a test sample z is to be classified, the corresponding prediction vector

containing all the kernel values can be easily computed as

f(z) = (f1 (z), ..., fn (z))T , (5.13)

and, can thus be readily used in any kernel-based approach.

5.4 Kernel Matrix Learning in Regression

Our kernel matrix learning approach is a generic approach, since the learned

kernel matrix could be plugged into any kernel-based methods in the settings such as

134
classification, regression and clustering, provided that an appropriate fitness selection

criterion in GA is given. In this section, we extend our kernel learning framework to

the regression problem to further demonstrate its utility.

For illustration we employ (KRR) [42], since it is commonly used in many applica-

tions. There are two types of parameters in KRR to be learned, the kernel matrix K

and the regularization parameter . We use the proposed GA-based approach defined

in the present work to jointly learn K and . The generalized cross-validation (GCV)

[96], is extended to serve as the selection criterion in our GA.

GCV is used for selecting in ridge regression, and can be formally written as

nk(In H())yk22
GCV () = ,
(tr (In H ()))2

where H() is the hat matrix which projects the label y to the corresponding predicted

label y, i.e.,

y = H()y.

In KRR, the predicted labels y for the training data can be obtained by

y = K(K + In )1 y = H(K, )y,

We can thus optimize both K and by minimizing

nk(In H(K, ))yk22


GCV (K, ) = , (5.14)
(tr (In H (K, )))2

In order to jointly learn both K and , the value of is added at the end of each

genome (i.e., as a new allele). In such a way, the GA operations and selection do

not need to be modified.

135
1
0.5

0.98

classification accuracy
0.96
0
0.94

0.92

0 5 10 15 20 25 30
0.5 generations
0.6 0 0.6
(a) (b)

Figure 5.5: (a) A XOR data classification problem. Samples in red triangle forms one
class and samples in blue circle forms another class. (b) This plot shows the classification
accuracy over the number of generations.

5.5 Experimental Results

5.5.1 A toy example

We first present a toy example to illustrate how the kernel matrix evolves and im-

proves during the generations using our genetic-based algorithm. We consider a XOR

data classification problem, Fig. 5.5(a). The data set contains two classes, and each

class distribution is represented by a mixture of two Gaussians. An independent test-

ing set from the same class distributions is generated to test the proposed approach.

Fig. 5.5(b) demonstrates the classification accuracy as the number of generations

increases. We see that the the classification accuracy gradually improves during the

generations and the algorithm converges in a few iterations.

We then illustrate how the learned kernel matrix evolves in Figure 5.6(a)-(f). In

the beginning, we observe that sample similarity diverges a lot within each class,

136
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

(a) 1st generation (b) 2nd generation

0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

(c) 4th generation (d) 8th generation

0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

(e) 20th generation (f) 30th generation

Figure 5.6: In this figure we show how the kernel matrix evolves. (a)-(f) illustrate the
kernel matrix in different generations.

137
which is due to the fact that the distance between the two clusters of the same class

is much larger than that within the same cluster. This means that the Euclidean

distance measure in the original space cannot capture the underlying sample similarity

within each class well. A good kernel matrix should indicate that the within-class

similarity is much larger than the between-class similarity. We can see in Figure 5.6

that using our algorithm, the within-class similarity gradually increases as the kernel

matrix evolves. This implies that our learned kernel matrix could induce a kernel

space where samples in the same class are as close as possible whereas samples in the

different classes are as far apart from each other as possible, leading to a much easier

classification problem.

To further evaluate how the kernel matrix is optimized during the generations, we

adopt the kernel alignment [18] to measure how close a learned kernel matrix to an

ideal kernel matrix K0 , with K0 (xi , xj ) = 1 if yi = yj , and 0 otherwise, where yi is

the class label of xi . The kernel alignment between two K and K0 is defined as

hK, K0iF
A(K, K0 ) = q ,
hK, KiF hK0 , K0 iF

where h., .iF is the Frobenius norm between two matrices and defined as hK1 , K2 iF =
P P
i j K1 (xi , xj )K2 (xi , xj ). The higher the kernel alignment is, the more similar the

two kernel matrices are. The kernel alignment between the learned kernel matrix and

the ideal one is shown in Figure 5.7. We see that the learned kernel matrix gets closer

to the ideal kernel matrix as we have more generations.

138
0.75

0.7

0.65

alignment
0.6

0.55

0.5

0.45
0 5 10 15 20 25 30
generations

Figure 5.7: This plot shows the kernel alignment between the learned kernel matrix and
the ideal one over the generations.

5.5.2 Classification algorithms

We employed the derived approach to learn the kernel mapping of two popular

algorithms KDA and SVM. We present comparative evaluations using a variety of

data-sets.

In KDA, the results are compared to kernel selection with CV, the Fisher criterion

of [108] and KBA [114]. These are denoted KDACV , KDAF , KDAK , respectively. The

nearest mean classifier is used in each of the corresponding subspaces. We choose this

classifier because it is Bayes optimal if the data in the kernel space is linearly sepa-

rable. We provide comparative results using transductive learning, denoted KDAT .

We also compare the proposed optimization approach to the traditional GA with

crossover and mutation operators only, denoted KDAT R .

139
In SVM, our results are compared to those obtained with CV, transductive learn-

ing, and traditional GA, SVMCV , SVMT and SVMT R . We also provide a comparison

with the multiple kernel learning algorithm of [4]. This algorithm applies sequen-

tial minimal optimization techniques (required in large-scale implementations) to a

smoothed version of a convex Moreau-Yosida optimization problem. We denote this

algorithm Support Kernel Machine (SKM). We also use this learned kernel matrix in

KDA and denote it KDAS . As a baseline, we also compare to the algorithm where

the kernel matrix is constructed from a uniform combination of different kernels. We

denote this algorithm to be KDAU and SVMU . For all the algorithms using a single

kernel function, the RBF kernel is used. For those algorithms where the parameters

are selected by CV, a 5-fold CV is conducted.

In order to demonstrate the effectiveness of the proposed regression-based gen-

eralization approach of Section 5.3, we compare it with a recently proposed semi-

supervised kernel matrix learning approach called kernel propagation (KP) [45]. In

this approach, the full kernel matrix is constructed from a seed-kernel matrix by max-

imizing the smoothness of the mapping over the data graph. The parameter of the

heat kernel used in calculating the affinity matrix is set as the averaged Euclidean dis-

tance from each data point to its ten nearest neighbors [45]. We denote this method

as KDAKP and SVMKP , respectively.

The initial population includes 30 individuals. We use random initialization and

a variety of commonly used kernels: RBF, polynomial, sigmoidal and Laplacian. A

typical range is given for the parameters of each kernel. The parameter for RBF

kernel is in [m1 2t1 , m1 + 2t1 ], where m1 and t1 are the mean and standard deviation

of the pairwise sample squared distances; the parameter for Laplacian kernel is in

140
Table 5.1: The parameters used in the experiments
Parameter value Description
pc 0.8 crossover rate
pm 0.05 mutation rate
pf 0.05 percentage of the best fitted individuals
kept
pn 0.03 percentage of the least fitted individu-
als kept
Lc 4 length of each gene
Lnc 10 length of each non-coding sequence
Lt 3 length of transposon
pt 0.02 transposition rate
s 0.01 scale of the Gaussian noise in transpo-
sition
pv 0.01 perturbation rate for each entry in
transposon
pd 0.01 deletion rate
Ld 3 mean of the deletion length
L2 d 4 variance of the deletion length
pi 0.1 insertion rate
Lv 5 length of each virus
r 6 size of the virus population

[m2 2t2 , m2 + 2t2 ], where m2 and t2 are the mean and standard deviation of the

pairwise sample distances; in the polynomial kernel, the degree is in [1, 5]; in the

sigmoidal kernel, k(x, y) = tanh(axT y + r), a = p1 , p is the dimensionality of the

data, and r is in [0, 1]. All kernels are aligned, i.e.,

hxi , xj i
hxi , xj i = q . (5.15)
hxi , xi ihxj , xj i

The parameter setup in our experiments is shown in Table 5.1. KRR is used to

train the embedding function, and the RBF kernel is used in KRR.

141
Table 5.2: KDA Recognition rates (in percentages) in the UCI data-sets.

Data-set KDAGA KDAT R KDAKP KDAT KDAU KDAS KDAK KDAF KDACV
Breast C. 76.4(2.9) 72.3(2.4) 70.4(3.9) 69.1(4.1) 65.8(3.7) 62.6(4.6) 68.4(3.3) 64.4(2.1) 66.6(5.2)
Ionosphere 93.4(1.3) 92.3(2.4) 87.2(2.3) 85.1(2.4) 94.6(2.0) 74.6(5.9) 80.6(4.1) 80.6(4.1) 86.6(1.2)
Liver 80.6(3.8) 76.8(3.1) 66.0(6.1) 66.4(5.8) 66.4(5.7) 69.9(7.1) 65.5(1.6) 65.8(4.4) 73.3(5.4)
Monk 1 94.0(3.7) 90.0(4.1) 86.2(4.0) 77.3(8.0) 84.7(4.5) 82.0(5.1) 85.3(5.1) 86.7(4.7) 84.0(6.0)
Monk 2 96.0(5.5) 95.3(5.1) 90.6(2.4) 92.7(4.4) 94.0(4.4) 93.3(5.3) 93.3(5.3) 90.7(4.4) 90.3(5.3)
Pima 78.4(1.6) 76.4(1.4) 70.6(4.9) 69.5(5.0) 71.6(3.5) 70.7(4.3) 71.2(3.5) 70.4(3.5) 72.5(2.5)

Table 5.3: SVM Recognition rates (%) in the UCI data-sets.

Data-set SVMGA SVMT R SVMKP SVMT SVMU SKM SVMK SVMCV


Breast C. 78.2(2.9) 74.2(3.7) 70.2(3.1) 66.5(8.8) 67.3(4.3) 62.2(5.1) 69.5(2.7) 70.2(2.1)
Ionosphere 96.9(1.2) 95.7(0.1) 92.6(3.0) 94.6(2.7) 94.3(2.0) 94.0(1.2) 93.4(2.1) 92.8(2.5)
Liver 80.6(1.9) 76.5(3.5) 68.9(5.6) 60.3(7.0) 71.3(6.8) 72.8(5.6) 71.6(5.5) 74.5(4.9)
Monk 1 95.3(3.8) 89.3(4.3) 80.5(6.8) 81.3(5.1) 86.0(2.8) 88.7(5.6) 89.3(4.3) 84.7(6.5)
Monk 2 97.3(2.8) 95.3(5.1) 93.0(4.0) 93.3(4.1) 93.3(5.8) 88.7(6.9) 94.0(4.9) 93.3(6.2)
Pima 79.2(1.5) 77.0(1.6) 69.2(4.0) 69.2(4.3) 73.2(2.3) 73.2(2.3) 72.6(2.4) 74.9(1.9)

142
5.5.3 UCI Repository

We apply the kernel learning approaches defined in this section to six data-sets

from the UCI repository [7]. In the Breast Cancer data-set, the task is to discriminate

two classes: no-recurrence and recurrence. The Ionosphere set is for the satellite

imaging detection of two classes (the presence or absence of structure) in the ground.

In the BUPA liver disorders set, a blood test with parameters are used to detect liver

disfunction. The goal of the Monk problem is to distinguish two distinct postures

of a robot. Monk 1 and 2 denote two alternative scenarios. Finally, the NIH Pima

data-set is used to detect diabetes from eight measurements.

For each data-set, we created five random partitions of the data, each with 80%

of the samples for training and the rest 20% for testing. The successful classification

rates on the above data-sets are shown in Tables 5.2 and 5.3. Both mean and stan-

dard deviation (in parentheses) are reported A paired t-test is used to check statistical

significance. The classification rate in bold is significantly higher than the others at

significance level 0.05. The proposed approach outperforms the other kernel learning

algorithms. The comparison of the proposed regression-based inductive learning ver-

sus the typical transductive alternative is also favorable to the proposed approach.

In addition, our approach does not need to re-estimate the kernel matrix every time

a previously unseen test sample is to be classified. Additionally, the approach de-

scribed in the present paper defines a smaller kernel matrix, with smaller memory

requirements.

We also report the training time in kernel matrix learning for each algorithm in

Table 5.4. Since no training is needed in the algorithm of uniform kernel combination,

we do not include this algorithm in the comparison. From Table 5.4, we first observe

143
Table 5.4: Average training time (in seconds) of each algorithm in the UCI data-sets.

Data-set GAOurs GAT SKM CV KBA


Breast C. 330.6 373.8 78.3 6.6 0.5
Ionosphere 339.1 737.2 39.2 13.5 2.0
Liver 409.7 1071.7 126.2 50.3 5.7
Monk 1 275.5 311.9 41.1 1.3 0.3
Monk 2 47.7 75.9 20.2 1.5 1.9
Pima 3095.8 4762.9 2681.2 66.5 10.9

that all the algorithms with multiple kernels need more training time than those with

single kernel. As we discussed before, the transductive learning is computationally

expensive and slower than our algorithm. Yet, our algorithm is slower than SKM in

these binary classification data-sets. However, we will see later that SKM becomes

much more time consuming when multi-class classification is performed.

A general question in GA-based approaches is to know how fast the algorithm

converges. This is, of course, problem specific. Figure 5.8(a) and (b) plot the classi-

fication accuracy as a function of iterations for two of the databases used above. To

obtain these plots, we executed our approach 50 times. The figures show the mean

and standard deviation. We observed a rapid convergence on the data-sets used.

Another interesting question is how well the proposed optimization approach com-

pares to the traditional GA algorithm with crossover and mutation only. Moreover,

how do the proposed advanced GA operators help to improve the kernel matrix? To

see this, we present additional plots with the traditional GA algorithm and each of

the proposed operators only. First, in Figure 5.8(b) and (g), we see that the tra-

ditional GA algorithm can improve the classification accuracy as the kernel matrix

144
0.96 0.96

0.94 0.94

classification accuracy

classification accuracy
0.92 0.92

0.9 0.9

0.88 0.88

0.86 0.86

0.84 0.84
0 10 20 30 40 50 0 10 20 30 40 50
generations generations

(a) (b)
0.94 0.94 0.95

0.93 0.93
0.94
classification accuracy

classification accuracy

classification accuracy
0.92
0.92
0.93
0.91
0.91
0.9 0.92
0.9
0.89
0.91
0.89
0.88
0.88 0.9
0.87

0.87 0.86 0.89


0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
generations generations generations

(c) (d) (e)


0.76
0.76

0.74
0.74
classification accuracy

classification accuracy

0.72 0.72

0.7 0.7

0.68 0.68

0.66 0.66
0 10 20 30 40 50 0 10 20 30 40 50
generations generations

(f) (g)
0.72 0.72 0.72

0.71 0.71 0.71


classification accuracy

classification accuracy

classification accuracy

0.7 0.7 0.7

0.69 0.69 0.69

0.68 0.68 0.68

0.67 0.67 0.67

0.66 0.66 0.66


0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
generations generations generations

(h) (i) (j)

Figure 5.8: Plots of the classification accuracy (y-axis) versus number of generations (x-
axis). The plots from (a) to (e) were obtained with different optimization approaches applied
to KDA using monk1 database, and the plots from (f) to (j) were obtained with different
optimization approaches applied to SVM using breast cancer database. (a) and (f) show
the proposed genetic-based optimization approach. (b) and (g) show the traditional GA
algorithm with crossover and mutation only.145(c) and (h) show GA algorithm with transition
operator only. (d) and (i) show GA algorithm with deletion operator only. (e) and (j) show
GA algorithm with insertion operator only.
Table 5.5: KDA Recognition rates (%) for large data-sets.

KDAGA KDAT R KDAKP KDAT KDAU KDAS KDAK KDAF KDACV


PIE10 78.3(1.4) 73.2(2.2) 70.6(2.0) 72.8(1.7) 74.4(1.7) 75.6(1.0) 61.0(2.0) 59.7(1.9) 64.5(1.6)
PIE20 90.8(1.0) 87.3(2.0) 86.5(0.9) 87.5(0.8) 88.4(0.7) 88.9(0.9) 86.9(0.9) 86.3(0.9) 86.8(1.6)
PIE30 93.9(0.6) 93.6(0.7) 89.6(1.1) 90.8(0.7) 93.4(1.0) 92.1(0.8) 93.3(0.8) 92.9(0.8) 92.5(0.7)
SPDM 85.7(0.8) 84.0(0.8) 82.6(0.9) 83.5(1.0) 83.6(0.9) 83.6(1.0) 83.8(0.9) 83.9(1.0) 84.0(1.2)

evolves. However, the final accuracies it obtains are lower than those obtained by the

proposed optimization approach. This means that the proposed additional operators

could further facilitate the optimization process and improve the classification per-

formance. From Figure 5.8(c)-(e) and (h)-(j), we see that each of the proposed new

operators can help optimizing the kernel matrix to improve the classification accu-

racy to some extent. For the same data-set, one operator may perform better than

the other, e.g., Figure 5.8(g) and (d). Some operator works better in one data-set

than another data-set, e.g., Figure 5.8(e) and (j). It is the combination of all of these

operators that makes our approach more effective in classification.

5.5.4 Large databases

Our next experiment is on the PIE data-set of face images [86]. Here, the task

is to classify faces according to the identity of the individual shown in the image.

All face images were aligned with regard to the main facial features and resized to a

standard size of 3232 pixels, as in [60]. The results are in Table 5.5 and 5.6. In these

tables, N specifies the number of images per class used to train the kernel matrix.

146
Table 5.6: SVM Recognition rates (%) for large data-sets.

Data-set SVMGA SVMT R SVMKP SVMT SVMU SKM SVMK SVMCV


PIE10 75.6(1.0) 71.3(2.7) 71.5(1.6) 72.3(1.2) 62.5(2.0) 73.5(1.0) 53.7(2.3) 67.3(2.3)
PIE20 87.3(0.4) 82.7(0.8) 83.4(0.6) 86.1(0.4) 83.0(0.4) 85.1(0.7) 80.8(0.6) 86.0(0.4)
PIE30 92.0(0.8) 91.5(0.8) 89.6(1.0) 90.7(0.6) 90.5(1.4) 90.9(1.0) 90.8(1.0) 92.0(0.8)
SPDM 85.9(0.7) 85.0(1.7) 80.5(0.4) 82.3(1.0) 84.6(1.2) 85.2(0.9) 85.5(0.9) 85.1(1.0)

Table 5.7: Average training time (in seconds) of each algorithm in large data-sets.

Data-set GAOurs GAT SKM CV KBA


PIE10 4.7 104 1.0 105 3.6 105 1.1 103 2.2 102
PIE20 6.5 104 2.2 105 6.2 105 4.0 103 6.4 102
PIE30 1.2 105 5.4 105 8.7 105 7.5 103 1.2 103
SPDM 9.1 104 2.3 105 5.6 104 2.5 103 2.7 102

147
The results are averaged by five random trials. As above, the proposed approach

outperforms the others.

We also used the Sitting Posture Distribution Maps (SPDM) data-set of [117]. In

this data-set, samples were collected using a chair equipped with a pressure sensor

sheet located on the sit-pan and back-rest of a chair. A total of 1,280 pressure values

from 50 individuals are provided from the pressure maps. There are five samples of

each of the ten different postures per individual. The goal is to classify each of the

samples into one of the ten sitting postures. We randomly selected 3 samples of each

posture and each individual for training, and used the rest for testing. The results

are then averaged by five trials. The results are shown in Table 5.5 and 5.6. The

proposed approach performs better than the other.

We report the average training time of each algorithm in Table 5.7. We again

see that our algorithm is faster than transductive learning. Moreover, in this case,

our algorithm is also faster than SKM. This is because SKM can only learns a kernel

matrix with two classes. When there are multiple classes, we have to use one versus

one mechanism to extend to multi-class cases. Thus, the training time greatly depends

on the total number of classes. The more classes there are, the more training time

it takes. Whereas our algorithm can directly deal with the multi-class case, thus is

more efficient.

5.5.5 Discussions of the genetic operators

In this section, we give a detailed discussion of how the genetic operators help

to optimize the kernel matrix. First, note that each genome u on which the genetic

operators are directly applied is formed by concatenating all the entries of a matrix L,

148
where K = LT L, and K is the kernel matrix to be learned. Denote L = (l1 , l2 , ..., ln ),

and li is a n 1 vector, where n is the number of training samples. Then



lT1 l1 . . . lT1 ln
. .. ..
..
K= . . .
T T
ln l1 . . . ln ln

This means that K(xi , xj ) = lTi lj . Thus, the changes in genome u will result in the

corresponding changes in the entries of the kernel matrix K. Now that we have this

interpretation, we can discuss how each genetic operator works to improve the kernel

matrix.

In crossover, two offsprings are obtained by combining two existing solutions as

in (5.1). For ease of discussion, suppose the crossover position r is a multiple of n,


[t+1]
i.e., r = mn, where m is an integer. After crossover, one of the offspring ui , is
[t+1] [t+1]
reshaped to form a new matrix, Li , with Li = (li1 , ..., lim , ljm+1 , ..., ljn ), where lik
[t]
is the k th column of Li . Then the corresponding kernel matrix is reconstructed by
[t+1] [t+1]T [t+1]
Ki = Li Li , which is

li1 ljm+1
T T T T
li li . . . li1 lim . . . li1 ljn
1. 1 .. .. ..
.. .. ..
. . . . .

T iT i iT j iT j
[t+1] lim li1 . . . lm lm lm lm+1 . . . lm ln
Ki = T jT jT T .

ljm+1 li1 . . . lm+1 lm lm+1 ljm+1
i
. . . ljm+1 ljn

.. .. .. .. .. ..
. . . . . .

T jT i jT j jT j
ljn li1 . . . ln lm ln lm+1 . . . ln ln
[t+1] [t] [t] [t]
Comparing Ki with Ki , we see that a submatrix of Ki , i.e., {Kij }i=m+1,...,n,j=1,...,n

has been replaced (note that since the kernel matrix is symmetric, only lower off

diagonal elements are considered). This submatrix corresponds to the classification


[t] [t]
of samples in one particular region of the feature space. Given Ki and Kj , if
[t] [t]
the corresponding submatrix in Kj can do better classification than that in Ki ,

149
[t+1]
then after crossover, the offspring Ki can improve the classification in the region
[t]
represented by this submatrix in Kj .

In the insertion operator, a random sequence is inserted to a randomly selected

position of a genome. Note that our feature representation incorporates the non-

coding sequences into the genome, allowing a flexible length of the genome. Thus,

the insertion of a sequence corresponds to a local change of the kernel matrix. More

formally, suppose that the insertion of the sequence causes a change of a vector lq
[t]
in Li , then this will result in a change of the corresponding row and column in the
[t+1]
kernel matrix Ki , i.e.,

lT1 lq

..

.
[t+1]
Ki = lTq l1 . . . lTq lq . . . lTq ln . (5.16)

..
.
lTn lq

This change will affect the similarity between the q th sample and all the other samples

in the data. As a result, the local classification function is changed by insertion, which

could help to resolve the misclassification in a local region of the feature space.

In the deletion operator, a random sequence is deleted from a random position

of the genome, which leads to a corresponding local change of the kernel matrix. If

we again suppose the deletion will cause a change of a vector lq , similar to insertion

operator, this will result in a change of the corresponding row and column in the

kernel matrix, as in (5.16). By deletion, the local classification function is rearranged

such that the classification in a section of the feature space could be improved.

In the copy-and-paste in transition, a sequence of genome is copied and inserted

to a new position in the same genome. Suppose the transposon is from lp and copied

to lq . This will cause the change of the q th column and row of the kernel matrix. This

150
implies that a local classification function with good performance is now applied to

a new region in the feature space. If this improves classification, then the new kernel

matrix will be selected.

In the cut-and-paste in transition, a sequence of genome is removed and inserted

to a new position in the same genome. Again, suppose the transposon is from lp and

moved to lq . This will cause the change of both pth and q th column and row of the

kernel matrix. This implies that a local classification function that does not work well

in one section of the feature space will now applied to a new section of the feature

space. If this improves classification, then the new kernel matrix will be selected over

the old one.

5.5.6 Application to regression

We select 7 data-sets from the UCI machine learning [7] and the DELVE collections

[29].

In the Boston housing data-set, the task is to predict the median value of a home

price. The auto mpg set details fuel consumption predicted in terms of 3 discrete

and 4 continuous attributes. In the Normtemp set, the goal is to predict the heart

rate based on gender and body temperature of 130 people. The Airport set requires

prediction of the enplaned revenue in tons of mail. The task in the Puma-8nm is

to predict angular accreditation from a simulation of the dynamics of a robot arm.

And, the Kin problem requires us to predict the distance of the end-effector from a

target in a simulation of the forward dynamics of an 8 link all-revolute robot arm.

Two cases with moderate and high amount of noise are considered, denoted Kin-8nm

and Kin-8nh.

151
For the first four data-sets, we randomly select 90% of the samples for training,

and use the rest for testing. This is repeated 10 times and the mean and standard

deviation of the errors are reported. The remaining databases have a larger number

of samples, allowing a random split into disjoint subsets. The first 1,024 samples

in each subset are used for training, while the others form the testing set. Again,

we report the mean and standard deviation of the errors of four splits. We use the

root mean squared error (RMSE) as our measure of the deviation between the true
Pn 1/2
response yi and the predicted response yi , i.e., RMSE = [n1 i=1 (yi yi )2 ] .

We compare the proposed approach with two state-of-the-art regression methods,

KRR and Support Vector Regression (SVR). In KRR, the parameters are selected

by CV and GCV, denoted by KRRCV and KRRGCV , respectively. The parameters

in SVR are selected by CV. A recent work [87] introduces the use of multiple kernels

into SVR, allowing a multiple kernel learning approach for regression by using semi-

infinite linear programming. Later, [76] shows how regression with multiple kernel

learning is performed by quadratically constrained quadratic programming. Here

we compare to the approach in [76], and denote it MKL-SVR. Another work [13]

performs multiple kernel learning in the context of KRR. We denote it MKL-KRR.

We also provide comparative results with transductive learning, the traditional GA,

and uniform kernel combination, denoted KRRT , KRRT R and KRRU . The results of

KP for generalizing to new data are also reported, denoted KRRKP .

The regression performances of all algorithms are shown in Table 5.8. The pro-

posed kernel approach is generally superior to the other state-of-the-art algorithms.

We also show the training time in Table 5.9. We see that our algorithm takes com-

parable training time than the other two multiple kernel learning algorithm, i.e.,

152
Table 5.8: Mean and standard deviation of the RMSE.

Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
Housing 2.75(0.77) 2.66(0.46) 5.73(2.54) 2.93(0.90) 3.27(0.79) 3.35(1.08) 3.35(1.30) 3.11(1.09) 2.52(0.77) 2.53(0.84)
Mpg 2.24(0.26) 2.73(0.60) 2.96(0.50) 2.50(0.45) 2.62(0.28) 2.96(0.60) 3.01(0.66) 2.82(0.73) 2.76(0.35) 2.70(0.35)
Normtemp 5.56(1.15) 6.79(1.08) 7.24(1.40) 6.45(1.15) 7.00(0.79) 7.44(0.85) 7.35(1.30) 7.58(1.60) 7.85(0.80) 8.32(1.42)
Puma-8nm 1.40(0.02) 1.44(0.02) 3.11(0.02) 1.51(0.03) 1.62(0.02) 1.60(0.03) 1.44(0.03) 2.27(0.42) 1.70(0.04) 1.77(0.04)
Puma-8nh 3.52(0.06) 3.52(0.11) 4.18(0.11) 3.61(0.07) 3.56(0.08) 3.54(0.09) 3.46(0.13) 3.68(0.08) 3.72(0.07) 3.66(0.07)
Kin-8nm 0.10(0.003) 0.11(0.003) 0.16(0.002) 0.12(0.001) 0.14(0.002) 0.13(0.004) 0.11(0.002) 0.12(0.01) 0.10(0.003) 0.11(0.002)
Kin-8nh 0.18(0.003) 0.18(0.003) 0.21(0.004) 0.20(0.003) 0.19(0.004) 0.19(003) 0.19(0.005) 0.19(0.009) 0.18(0.004) 0.18(0.004)

Table 5.9: Average training time (in seconds) of each algorithm.


Data-set GAOurs GAT CV GCV MKL-SVR MKL-KRR
Housing 6.6 103 9.7 104 1.4 103 3.6 102 5.5 103 5.8 103
Mpg 1.8 103 4.5 103 7.7 102 71.0 2.0 103 1.8 103
Normtemp 150.0 550.0 19.6 3.7 80.1 46.9
Puma-8nm 4.0 104 2.0 105 1.7 104 9.9 102 2.4 104 1.7 104
Puma-8nh 4.0 104 1.9 105 1.3 104 8.3 102 2.0 104 1.3 104
Kin-8nm 3.7 104 1.7 105 9.0 103 8.9 102 2.4 104 1.7 104
Kin-8nh 3.6 104 1.7 105 2.0 104 1.1 103 1.7 104 1.7 104

MKL-SVR and MKL-KRR, but has an advantage that better prediction is achieved.

To conclude, we apply our approach to age estimation from images of face. Aging

process can induce significant changes in human facial appearances, which is generally

detectable in images. We used the FG-NET aging database of [2] to model these

changes. This database includes 1,002 face images of 82 subjects at different ages.

The ages range from 0 to 69. Face images include changes in illumination, pose,

expression and occlusion (e.g., glasses and beards). All images are warped to a

153
Table 5.10: MAE of the proposed approach and the state-of-the-art in age estimation.

Data-set KRRGA KRRT R KRRKP KRRT KRRCV KRRGCV SVRCV MKL-SVR KRRU MKL-KRR
MAE 5.87(0.22) 5.95(0.31) 12.89(0.65) 6.31(0.30) 6.59(0.31) 13.83(0.79) 6.46(0.35) 7.18(0.46) 27.2(19.7) 8.05(0.40)

standard size of 60 60 pixels with all major facial features properly aligned, as in

[60]. We represent each image as a vector concatenating all the pixels of the image,

i.e., the appearance-based feature representation.

We generate five random partitions of the data, each with 800 images for training

and 202 for testing. The mean absolute errors (MAE) are in Table 5.10. Again, We

can see that the proposed approach outperforms the other algorithms in predicting

the age of individuals.

5.6 Conclusions

We have proposed a genetic-based optimization mechanism to find the kernel map

minimizing the classification error of complex, non-linearly separable problems. In

particular, we introduced a coding-non-coding representation and defined three novel

operators transposition, insertion and deletion. These include viral infections that

result in DNA changes and yields an efficient search strategy within the vast space

of all possible kernel matrices. Regression is then used to estimate the underlying

mapping function given by the resulting kernel matrix, resolving the complexity is-

sues of transductive learning. We also extend the proposed kernel matrix learning

154
framework to work in regression. Comparative results against classical kernel meth-

ods demonstrate the superiority of the proposed approach. We have also shown fast

convergence on the databases used.

155
CHAPTER 6

CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

Kernel methods have been extensively used in machine learning and shown to have

good generalization ability in many applications. A key problem in kernel methods

is how to determine the mapping model that leads to better learning and improved

generalization performance. This dissertation gives a comprehensive study of the

model selection problems for kernel methods in pattern recognition and machine

learning. We focus on two typical scenarios in supervised learning: classification and

regression. In each scenario, we have proposed several novel approaches to learning.

This involves learning both the kernel mappings and parameters.

In Chapter 2, we derived two criteria to optimize the kernel parameters given a

parametrized kernel function in classification. Many approaches have been proposed

for kernel optimization in the literature, but these are not directly related to the

idea of the Bayes optimal classifier in the kernel space, which is the classifier with the

smallest possible classification error. Our approaches are inspired by Bayes optimality

and we fully exploit this idea. In the first approach, we want to achieve the original

goal of the kernel mapping: the class distributions in the kernel space can be linearly

separated. To do this, we first derive a homoscedastic criterion which measures the

156
degree of homoscedasticity of the class distributions. Then, the kernel parameters

can be optimized by simultaneously maximizing the homoscedasticity and separabil-

ity between the pairwise class distributions. This optimization enforces the linear

separability of the classes to the largest extent. To relax the single Gaussian distri-

bution assumption for each class, we use a mixture of Gaussians to define each class

and show that our criterion can be easily modified to adapt to this new modeling. We

also show how our approach can be efficiently employed using a quasi-Newton based

optimization technique.

In the second criterion, instead of exploring a linear classifier, we directly mini-

mize the Bayes classification error in the kernel space over all the kernel mappings

to optimize the kernel parameters. This is plausible because different kernel presen-

tations result in different Bayes error. We first derive an effective measure which

approximates the Bayes accuracy (defined by one minus Bayes error) in the kernel

space, and then maximize this measure to find the optimal kernel parameters. We

further show how to employ our criterion to discover the underlying subclass divi-

sions in each class. Extensive experiments using a number of well-known databases in

object categorization, face recognition, handwritten digit classification demonstrate

both the effectiveness and efficiency of our methods over the state of the art.

In Chapter 3, we propose a framework to do model selection in kernel-based re-

gression approaches. Model selection in linear regression has been largely studied.

However, it is not adequately explored in nonlinear regression, The goal is to achieve

a good balance between the model fit and model complexity in a regression model.

From the well-known bias-variance trade-off, we know we cannot simultaneously re-

duce both of them. If one is reduced, the other increases, and the vice versa. We

157
first derive measures for model fit and model complexity from a decomposition of the

generalization error of the learned function and show that balancing the two measures

is equivalent to minimizing the generalization error. Then, we adopt a multiobjec-

tive optimization approach to balance the two measures by exploring Pareto-optimal

solutions. A modified -constraint method is presented to guarantee the solutions

to be Pareto-optimal. The proposed model selection approach is applied to kernel

ridge regression and kernel principal component regression, which are two popularly

used kernel-based regression methods. Experiments using many benchmark data-sets

show that the proposed approach performs generally better than other model selection

methods and state of the art regression approaches.

In kernel methods literature, the Gaussian RBF kernel is one of the most popularly

and successfully used kernels. In this kernel, the sample similarity is evaluated using

a fixed local window size. Thus, the estimation with over-fitting or under-fitting

problems may arise if the local data density changes. We introduce a new family of

kernels called Local Density Adaptive Kernels in Chapter 4. The window size of our

kernels can vary to adaptively fit the local data density, thus giving a better likelihood

evaluation. By implicitly changing the shape of our kernels, we show that our kernels

are Mercer kernels, and hence can be directly used in any kernel methods such as

Kernel Discriminant Analysis and Support Vector Machine. We then show that our

kernels outperform the fixed-shape kernels such as the RBF kernel and Mahalanobis

kernel in many applications.

Thus far we only consider one single kernel function in kernel methods. In many

applications, the use of multiple kernel functions would be more appropriate since it

combines the characteristics of all kernels, leading to better learning. In the literature,

158
many approaches have been proposed to construct a linear or nonlinear combination

of multiple kernels, which needs a pre-specified formulation for combination. Un-

fortunately, no prior knowledge is available to indicate which combination is better.

To resolve this, we introduced a new multiple kernel learning approach in Chap-

ter 5 by employing genetic algorithms. The main advantage of our method is that

there is no need to specify an explicit combination of multiple kernels, and the ker-

nel matrix can evolve during the generations using the genetic operators until the

classification/prediction error falls below a given threshold. We also introduce a new

genetic representation for each kernel matrix and present more advanced operators to

facilitate the optimization process. We then show how to learn a mapping function

represented by the learned kernel matrix to generalize to the test data. We applied

our kernel matrix learning algorithm to both classification and regression.

6.2 Future work

In this dissertation we have addressed one important problem in kernel meth-

ods, i.e., model selection. This problem directly determines the performance of ker-

nel methods. Another important problem is the computational cost of these kernel

methods. This involves both computational time and memory. For a data-set with n

samples, the complexity of a kernel algorithm is typically O(n3 ). If n is very large,

then it is computationally expensive. Also, a kernel algorithm usually requires at

least several n n matrices to be stored in the memory, which needs a large amount

of memory space when n is large. Since the size of the real world data is commonly

huge, if we want to apply the kernel methods to such data, we need to find some

way to reduce the computational cost in order to make it work efficiently in practice.

159
One possible solution is to define some sparse learning techniques. For example, the

learning model could be represented by a smaller size of the data, i.e., to obtain a rep-

resentative subset of the data during learning. This could be extremely useful when

high redundancy exists in the data. We can also explore how our model selection

approaches can be adapted to sparse learning techniques.

Another problem is model selection in other machine learning applications. Thus

far, we only consider classification and regression. There are many other useful ap-

plications such as data clustering, manifold learning, ranking, etc. Since the goal

in these applications are generally different from that in classification or regression,

different model selection methods are needed for each specific application domain.

160
BIBLIOGRAPHY

[1] S. Abe. Training of support vector machines with mahalanobis kernels. In Proc.
International Conference on Artificial Neural Networks, pages 571576, 2005.

[2] FG-NET aging database. http://www.fgnet.rsunit.com/.

[3] E. E. Andersen and A. D. Andersen. The mosek interior point optimizer for
linear programming: An implementation of the homogeneous algorithm. High
Performance Optimization, pages 197232, 2002.

[4] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning,


conic duality, and the SMO algorithm. In Proc. International Conference on
Machine Learning, pages 4148, 2004.

[5] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel


approach. Neural Computation, 12(10):28352404, 2000.

[6] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University


Press, 1995.

[7] C. L. Blake and C. J. Merz. UCI repository of machine


learning databases. University of California, Irvine, http
://www.ics.uci.edu/mlearn/MLRepository.html, 1998.

[8] L. Bregman. The relaxation method of finding the common point of convex sets
and its application to the solution of problems in convex programming. USSR
Comp. Mathematics and Mathematical Physics, 7:200217, 1967.

[9] L. Breiman, W. Meisel, and E. Purcell. Variable kernel estimate of multivariate


densities. Technometrics, 19:135144, 1977.

[10] A. B. Chan and N. Vasconcelos. Probabilistic kernels for the classification of


auto-regressive visual processes. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition, pages 846851, 2005.

161
[11] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple
parameters for support vector machines. Machine Learning, 46(1-3):131159,
2002.

[12] B. Chen, L. Yuan, H. Liu, and Z. Bao. Kernel subclass discriminant analysis.
Neurocomputing, 2007.

[13] C. Cortes, M. Mohri, and A. Rostamizadeh. L2-regularization for learning ker-


nels. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial
Intelligence, 2009.

[14] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations


of kernels. In Advances in Neural Information Processing Systems, 2009.

[15] K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In


Advances in Neural Informaiton Processing systems, pages 537544, 2003.

[16] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Dynamically adapting ker-


nels in support vector machines. In Advances in neural information processing
systems II, pages 204 210, 1998.

[17] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel align-


ment. In Advances in Neural Information Processing Systems 14, 2002.

[18] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target


alighment. In Proc. Advances in Neural Informaiton Processing systems, pages
367373, 2001.

[19] F. De la Torre and O. Vinyals. Learning kernel expansions for image classifica-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pages 17, 2007.

[20] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal
Machine Learning Research, 7:130, 2006.

[21] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Optimization


and Nonlinear Equations. Englewood Cliffs, NJ: Prentice-Hall, 1983.

[22] A. Desai, H. singh, and V. Pudi. Gear: Generic, efficient, accurate knn-based
regression. In Intl Conf on Knowledge Discovery and Information Retrieval,
2010.

[23] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods


for the classification of tumors in gene expression data. Technical Report 576,
University of California Berkeley, Dept. of Statistics, 2000.

162
[24] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.
Annals of Statistics, 32(2):407499, 2004.

[25] A. Elgammal, R. Duraiswami, and L. S. Davis. Efficient kernel density esti-


mation using the fast gauss transform with applications to color modeling and
tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(11):14991504, 2003.

[26] G. Fan and J. Gray. Regression tree analysis using target. Journal of Compu-
tational and Graphical Statistics, 14(1):113, 2005.

[27] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals


of Eugenics, 7, 1936.

[28] R. A. Fisher. The statistical utilization of multiple measurements. Annals of


Eugenics, 8:376386, 1938.

[29] Data for Evaluating Learning in Valid Experiments (DELVE).


http://www.cs.toronto.edn/ delve/. university of toronto, toronto, ontario,
canada.

[30] J. H. Friedman. Regularized discriminant analysis. Journal of the American


Statistical Association, 84:165175, 1989.

[31] K. Fukunaga and J.M. Mantock. Nonparametric discriminant analysis. IEEE


Trans. Pattern Analysis and Machine Intelligence, 5:671678, 1983.

[32] Keinosuke Fukunaga. Introduction to statistical pattern recognition (2nd ed.).


Academic Press, San Diego, CA, 1990.

[33] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In Un-


certainty in Artificial Intelligence, pages 148155, 1998.

[34] T. Glasmachers and C. Igel. Maximum likelihood model selection for 1-norm
soft margin svms with multiple parameters. IEEE Transactions on Pattern
Analysis and Machine Intelligenc, 32(8):2010, 1522-1528.

[35] C. Gold and P. Sollich. Model selection for support vector machine classification.
Neurocomputing, 55:221249, 2003.

[36] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine


Learning. Kluwer Academic Publishers, Boston, MA, 1989.

[37] G. H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a


method for choosing a good ridge parameter. Technometrics, 21(2):215223,
1979.

163
[38] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In Proc. Inter-
national Conference on Machine Learning, 2008.
[39] Y. Y. Haimes, L. S. Lasdon, and D. A. Wismer. On a bicriterion formulation
of the problems of integrated system identification and system optimization.
IEEE Transactions on Systems, Man, and Cybernetics, pages 296297, 1971.
[40] O. C. Hamsici and A. M. Martinez. Bayes optimality in linear discriminant
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30:647657, 2008.
[41] O. C. Hamsici and A. M. Martinez. Rotation invariant kernels and their appli-
cation to shape analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2009.
[42] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer-Verlag (2nd Edition), New York, NY, 2001.
[43] X. He and P. Niyogi. Locality preserving projections. In Proc. Advances in
Neural Information Processing Systems 16, 2004.
[44] L. Holmstrom and P. Koistinen. Using additive noise in back-propagation train-
ing. IEEE Transactions on Neural Networks, 3(1):2438, 1992.
[45] E. Hu, S. Chen, D. Zhang, and Yin X. Semisupervised kernel matrix learning by
kernel propagation. IEEE Transactions on Neural Networks, 21(11):18311841,
2010.
[46] T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to
detect remote protein homologies. In Proc. Internation Conference on Intelli-
gent Systems for Molecular Biology, pages 149158, 1999.
[47] N. Karmarkar. A new polynomial time algorithm for linear programming. Com-
binatorica, 4(4):373395, 1984.
[48] V. Katkovnik and I. Shmulevich. Kernel density estimation with varying win-
dow size. Pattern Recognition Letters, 23:16411648, 2002.
[49] S.J. Kim, A. Magnani, and S. Boyd. Optimal kernel selection in kernel fisher
discriminant analysis. In Int. Conf. Machine Learning, pages 465472, 2006.
[50] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. lp norm multiple kernel
learning. Journal of Machine Learning Research, 12:953997, 2011.
[51] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan.
Learning the kernel matrix with semidefinite programming. Journal of Machine
Learning Research, 5:2772, 2004.

164
[52] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of IEEE, 92(11):22782324, 1998.

[53] B. Leibe and B. Schiele. Analyzing appearance and contour based methods
for object categorization. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, 2003.

[54] J. Liu, J. Chen, S. Chen, and J. Ye. Learning the optimal neighborhood kernel
for classification. In International Joint Conference on Artificial Intelligence,
Pasadena, California, 2009.

[55] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text


classification using string kernels. Journal of Machine Learning Research, 2:419
444, 2002.

[56] D. Loftsgaarden and C. Quesenberry. A nonparametric estimate of a multi-


variate density function. Annals of Mathimatical Statistics, 36(3):10491051,
1965.

[57] M. Loog and R. P. W. Duin. Linear dimensionality reduction via a heteroscedas-


tic extension of lda: The chernoff criterion. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(6):732739, 2007.

[58] M. Loog, R. P. W. Duin, and R. Haeb-Umbach. Multiclass linear dimension


reduction by weighted pairwise fisher criteria. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(7):762766, 2001.

[59] J. R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications


in Statistics and Econometrics, 2nd Edition. John Wiley and Sons, 1999.

[60] A. M. Martinez. Recognizing imprecisely localized, partially occluded and ex-


pression variant faces from a single sample per class. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(6):748763, 2002.

[61] A. M. Martinez and R. Benavente. The AR Face Database. CVC Technical


Report No. 24, June, 1998.

[62] A. M. Martinez and O. C. Hamsici. Who is LB1? discriminant analysis for the
classification of specimens. Pattern Rec., 41:34363441, 2008.

[63] A. M. Martinez and M. Zhu. Where are linear feature extraction methods
applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(12):19341944, 2005.

165
[64] B. McClintock. The origin and behavior of mutable loci in maize. In Proceedings
of the National Academy of Sciences of the USA, volume 36, pages 344355,
1950.
[65] G. McLachlan and K. Basford. Mixture Models: Inference and applications to
clustering. Marcel Dekker, 1988.
[66] K. Miettinen. Nonlinear Multiobjective Optimization, volume 12 of Interna-
tional Series in Operations Research and Management Science. Kluwer Aca-
demic Publishers, Dordrecht, 1999.
[67] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller. Fisher discriminant
analysis with kernels. In Proc. IEEE Neural Networks for Signal Processing
Workshop, pages 4148, 1999.
[68] M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996.
[69] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A kullback-leibler divergence based
kernel for svm classification in multimedia applications. In Advances in Neural
Information Processing Systems, 2003.
[70] E. A. Nadaraya. On estimating regression. Theory of Probability and its Appli-
cations, 9:141142, 1964.
[71] M. H. Nguyen and F. De la Torre. Robust kernel principal component analysis.
In Advances in Neural Information Processing Systems, 2008.
[72] F. Odone, A. Barla, and A. Verri. Building kernels from binary strings for image
matching. IEEE Transactions on Image Processing, 14(2):169180, 2005.
[73] E. Parzen. On estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33:10651076, 1962.
[74] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for
predictivity in learning theory. Nature, 428:419422, 2004.
[75] O. Pujol and D. Masip. Geometry-based ensembles: Towards a structural char-
acterization of the classification boundary. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 31(6):11401146, 2009.
[76] S. Qiu and T. Lane. A framework for multiple kernel support vector regression
and its applications to sirna efficacy prediction. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 6(2):190199, 2009.
[77] Y. Radhika and M. Shashi. Atmospheric temperature prediction using support
vector machines. International Journal of Computer Theory and Engineering,
1(1):5558, 2009.

166
[78] C. R. Rao. The utilization of multiple measurements in problems of biological
classification. J. Royal Statistical Soc., B, 10:159203, 1948.

[79] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. the
MIT Press, 2006.

[80] C. E. Rasmussen and Z. Ghahramani. Occams razor. In Advances in Neural


Information Processing Systems 13, 2001.

[81] P. Russell. Genetics. Addison- Wesley, 1998.

[82] B. Scholkopf. The kernel trick for distances. In Advances in Neural Information
Processing Systems, pages 301307, 2000.

[83] Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller. Nonlinear


component analysis as a kernel eigenvalue problem. Neural Compututation,
10(5):12991319, 1998.

[84] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[85] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cam-
bridge University Press, 2004.

[86] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression
(PIE) database. In Proceedings of the 5th IEEE International Conference on
Face and Gesture Recognition, 2002.

[87] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple


kernel learning. Journal of Machine Learning Research, 7:15311565, 2006.

[88] M. Stone. Cross-validatory choice and assessment of statistical predictions (with


discussion). Journal of the Royal Statistical Society, Series B, 36:111147, 1974.

[89] G. Terrell and D. Scott. Variable kernel density estimation. The Annals of
Statistics, 20(3):12361265, 1992.

[90] C. M. Theobald. An inequality for the trace of the product of two symmetric
matrices. Proceedings of the Cambridge Philosophical Society, 77:256267, 1975.

[91] M. E. Tipping. Sparse bayesian learning and the relevance vector machine.
Journal of Machine Learning Research, (1):211244, 2001.

[92] V. Vapnik. The Nature of Statistical Learning Theory. New York: Springer,
1995.

[93] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.

167
[94] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function
approximation, regression estimation, and signal processing. In InM.Mozer,M.
Jordan, T. Petsche (Eds.), Advances in neural information processing systems,
9, The MIT Press, Cambridge, MA, 1996.

[95] M. Varma and B. R. Babu. More generality in efficient multiple kernel learning.
In Proc. International Conference on Machine Learning, pages 465472, 2009.

[96] Grace Wahba. Spline Models for Observational Data. Society for Industrial and
Applied Mathematics, 1990.

[97] J. Wang, H. P. Lu, K. N. Plataniotis, and J. W. Lu. Gaussian kernel optimiza-


tion for pattern classification. Pattern Recognition, 42(7):12371247, 2009.

[98] L. Wang, K.L. Chan, P. Xue, and L.P. Zhou. A kernel-induced space selection
approach to model selection in klda. IEEE Trans. Neural Networks, 19:2116
2131, 2008.

[99] S. Wang, W. Zhu, and Z. Liang. Shape deformation: Svm regression and
application to medical image segmentation. In Proceedings of International
Conference on Computer Vision, 2001.

[100] Yong Wang. A New Approach to Fitting Linear Models in High Dimensional
Spaces. PhD dissertation, University of Waikato, 2000.

[101] Z. Wang, S. C. Chen, and T. K. Sun. Multik-mhks: A novel multiple ker-


nel learning algorithm. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(2):348353, 2008.

[102] Cambridge weather database. http://www.cl.cam.ac.uk/research


/dtg/weather/. University of Cambridge.

[103] K. Q. Weinberger and G. Tesauro. Metric learning for kernel regression. In


Eleventh International Conference on Artificial Intelligence and Statistics, Om-
nipress, 2007.

[104] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression.


In In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances
in Neural Information Processing Systems 8, pages 514520, The MIT Press,
Cambridge, MA, 1996.

[105] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Jour-
nal of Machine Learning Research, 4:913931, 2003.

168
[106] G. Wu and E. Chang. Adaptive feature-space conformal transofrmation for
imbalanced-data learning. In Proc. International Conference on Machine Learn-
ing, pages 816823, 2003.

[107] S. Wu and S. Amari. Conformal transformation of kernel functions: A data-


depedent way to improve support vector machine classifiers. Neural Processing
Letters, 15:5967, 2002.

[108] H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the kernel in the
empirical feature space. IEEE Transactions on Neural Networks, 16(2):460
474, 2005.

[109] Jian Yang, Alejandro F. Frangi, Jing-yu Yang, David Zhang, and Zhong Jin.
KPCA plus LDA: A complete kernel fisher discriminant framework for feature
extraction and recognition. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 27(2):230244, 2005.

[110] Ming-Hsuan Yang. Kernel eigenfaces vs. kernel fisherfaces: Face recognition
using kernel methods. In Proc. IEEE International Conference on Automatic
Face and Gesture Recognition, 2002.

[111] J. Ye, S. Ji, and J. Chen. Multi-class discriminant kernel learning via convex
programming. J. Machine Lear. Res., 9:719758, 2008.

[112] D. Yeung, H. Chang, and G. Dai. Learning the kernel matrix by maximizing a
kfd-based class separability criterion. Pattern Recognition, 40:20212028, 2007.

[113] D. You, O. C. Hamsici, and A. M. Martinez. Kernel optimization in discriminant


analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
33(3):631638, 2011.

[114] D. You and A. M. Martinez. Bayes optimal kernel discriminant analysis. In


Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 3533
3538, 2010.

[115] S. Zhou, B. Georgescu, X. Zhou, and D. Comaniciu. Image based regression


using boosting method. In Proceedings of the Tenth IEEE International Con-
ference on Computer Vision, 2005.

[116] M. Zhu and A. M. Martinez. Subclass discriminant analysis. IEEE Trans.


Pattern Analysis and Machine Intelligence, 28(8):12741286, 2006.

[117] M. Zhu and A. M. Martinez. Pruning noisy bases in discriminant analysis.


IEEE Transactions Neural Networks, 19(1):148157, 2008.

169

Você também pode gostar