Escolar Documentos
Profissional Documentos
Cultura Documentos
ADVANCED TOPICS
IN SCIENCE AND TECHNOLOGY IN CHINA
Zhejiang University is one of the leading universities in China. In Advanced
Topics in Science and Technology in China, Zhejiang University Press and
Springer jointly publish monographs by Chinese scholars and professors, as
well as invited authors and editors from abroad who are outstanding experts
and scholars in their elds. This series will be of interest to researchers, lecturers, and graduate students alike.
Advanced Topics in Science and Technology in China aims to present the
latest and most cutting-edge theories, techniques, and methodologies in various research areas in China. It covers all disciplines in the elds of natural
science and technology, including but not limited to, computer science, materials science, life sciences, engineering, environmental sciences, mathematics,
and physics.
Kaizhu Huang
Haiqin Yang
Irwin King
Michael Lyu
Machine Learning
Modeling Data Locally and Globally
With 53 gures
AUTHORS:
Dr. Kaizhu Huang,
Dept. of CSE,
Chinese Univ. of Hong Kong,
Shatin. N.T. HK,
China
Email: kzhuang@cse.cuhk.edu.hk
Preface
Kaizhu Huang
Haiqin Yang
Irwin King
Michael R. Lyu
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Learning and Global Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Learning and Local Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Hybrid Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
5
5
8
8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
15
16
16
19
21
22
23
24
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3
29
30
31
31
32
33
34
39
VIII
Contents
42
45
46
47
48
49
50
50
50
55
56
60
65
66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4
69
71
71
74
78
80
82
84
85
85
86
88
88
90
93
93
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5
97
98
98
100
100
101
102
Contents
IX
102
102
104
111
114
115
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6
119
121
121
122
122
124
124
125
127
128
128
130
131
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7
133
134
136
136
139
139
140
141
141
142
149
155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8
161
161
163
163
Contents
1
Introduction
1 Introduction
the phenomena are usually nondeterministic. This motivates to base probabilistic or statistical models to perform a global investigation on sampled data
from the phenomena. A common way for achieving this goal is to t a density
on the observations of data. With the learned density, people can then incorporate prior knowledge, conduct predictions, and perform inferences and
marginalizations. One main category in the framework of global learning is
the so-called generative learning. By assuming a specic mathematical model
on the observations of data, e.g. a Gaussian distribution, the phenomena can
therefore be described or re-generated. Fig. 1.1 illustrates such an example.
In this gure, two classes of data are plotted as s for the rst class and
s for the other class. The data can thus be modeled as two dierent mixtures of Gaussian distributions as illustrated in Fig. 1.2. By knowing only the
parameters of these distributions, one can then summarize the phenomena.
Furthermore, one can clearly employ this information to distinguish one class
of data from the other class or simply know how to separate two classes. This
is also well-known as Bayes optimal decision problems [12, 6].
In the development of learning approaches within the community of machine learning, there has been a migration from the early rule-based methods [11, 32] wanting more involvement of domain experts, to widely-used
probabilistic global models mainly driven by data itself [5, 9, 14, 17, 22, 33].
However, one question for most probabilistic global models is what kind of
global models, or more specically, which type of densities should be specied beforehand for summarizing the phenomena. For some tasks, this can be
prescribed by a slight introduction of domain knowledge from experts. Unfortunately, due to both the increasing sophistication of the real world learning
tasks and active interactions among dierent subjects of research, it is more
and more dicult to obtain fast and valuable suggestions from experts. A further question is thus proposed, i.e. what is the next step in the community
of machine learning, after experiencing a migration from rule-based models
to probabilistic global models? Recent progress in machine learning seems to
imply local learning as a solution.
1 Introduction
the classication purpose. Fig. 1.3 illustrates such a problem. In this gure,
the decision boundary is constructed only based on those lled points, while
other points make no contributions to the classication plane (the decision
boundary is given based on the Gabriel Graph method [1, 18, 34]).
However, although containing promising performance, local learning appears to locate itself at another extreme end to global learning. Employing
only local information may lose the global view of data. Consequently, sometimes, it cannot grasp the data trend, which is critical for guaranteeing better
performance for future data. This can be seen in the example as illustrated
in Fig. 1.4. In this gure, the decision boundary (also constructed by the
Gabriel Graph classication) is still determined by some local points indicated as lled points. Clearly, this boundary does not grasp the data trend.
Fig. 1.4. An illustration on that local learning cannot grasp data trend.
The decision boundary (constructed by the Gabriel Graph classication)
is determined by some local points indicated as lled points. It, however,
loses the data trend. The decision plane should be obviously closer to the
lled squares rather than locating itself in the middle of lled s and s
1 Introduction
Fig. 1.5. The relationship among the developed models in this book
1 Introduction
1.5 Scope
This book states and refers to the learning rst as statistical learning, which
appears to be the current main trend of learning approaches. We then further
restrict the learning in the framework of classication, one of the main problems in machine learning. The corresponding discussions on dierent models
including the conducted analysis of the computational and statistical aspects
of machine learning are all subject to the classication tasks. Nevertheless,
we will also extend the content of this book to regression problems, although
it is not the focus of this book.
References
Chapter 6
A novel regression model called the Local Support Vector Regression,
which can be regarded as an extension from the Maxi-Min Margin Machine, will be introduced in detail in this chapter. We will show that our
model can vary the tube (margin) systematically and automatically according to the local data trend. We will show that this novel regression
model is more robust with respect to the noise of data. Empirical evaluations on both synthetic data and real nancial time series data will
be presented to demonstrate the merits of our model with respect to the
standard Support Vector Regression.
Chapter 7
In this Chapter, we show how to adapt the margin settings locally for
the Support Vector Regression dierently from the LSVR. We demonstrate how the local view of data can be widely used in various models
or even dierently applied in the same model. Empirical evaluations are
also presented in comparison with other competitive models on nancial
data.
Chapter 8
We will then summarize this book and conduct discussions on future
work.
We try to make each of these chapters self-contained. Therefore, in several
chapters, some critical contents, e.g. model denitions or illustrative gures,
having appeared in previous chapters, may be briey reiterated.
References
1. Barber CB, Dobkin DP, Huhanpaa H (1996) The quickhull algorithm for convex
hulls. ACM Transactions on Mathematical Software 22(4):469483
2. Baum LE, Egon JA (1967) An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for
ecology. Bull. Amer. Meteorol. Soc. 73:360C-363
3. Bozdogan H (2004) Statistical Data Mining and Knowledge Discovery. Boca
Raton, Fla.: Chapman & Hall/CRC
4. Christopher J, Burges C (1998) A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge Discovery 2(2):121167
5. Chow CK, Liu CN (1968) Approximating discrete probability distributions
with dependence trees. IEEE Trans. on Information Theory 14:462467
6. Duda R, Hart P(1973) Pattern Classication and Scene Analysis. New York,
NY: John Wiley & Sons
7. Forsyth DA, Ponce J (2003) Computer Vision: A Modern Approach. Upper
Saddle River, N.J. : Prentice Hall
8. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classiers.
Machine Learning 29:131161
9. Fukunaga K (1990) Introduction to Statistical Pattern Recognition. San Diego,
Academic Press, 2nd edition
10
References
References
11
34. Zhang W, King I (2002) A study of the relationship between support vector
machine and Gabriel Graph. In Proceedings of IEEE World Congress on Computational IntelligenceInternational Joint Conference on Neural Networks
2
Global Learning vs. Local Learning
In this chapter, we conduct a more detailed and more formal review on two
dierent schools of learning approaches, namely, the global learning and local
learning. We rst provide a hierarchy graph as illustrated in Fig. 2.1 in which
we try to classify many statistical models into their proper categories, either
global learning or local learning. Our review will also be conducted based on
this hierarchy structure. To make it clear, we use lled shapes to highlight
our own work in the graph.
Global learning ts a distribution over data. If a specic mathematical
model, e.g. a Gaussian model, is assumed on the distribution, this is often
called generative learning, whose name implies that the mathematical formulation of the assumed model governs the generation of data in the learning
task. To learn the parameters from the observations of data for the specic
model, several schemes have been proposed. This includes Maximum Likelihood (ML) learning, which is easy to conduct but is less accurate, Conditional
Likelihood (CL) learning, which is usually hard to perform optimization but
is more eective, and Bayesian Average (BA) learning, which has a comparatively short history but is more promising. As generative learning pre-assigns
a specic model before learning, it often lacks the generality and thus may
be invalid in many cases. This thus motivates the non-parametric learning,
which still estimates a distribution on data but assumes no specic mathematical generative models. The common way in this type of learning is to
locally t over each observation a simple density and then sums all the local
densities as the nal distribution for data. Although in some circumstances,
this approach is successful, it is criticized for requiring a huge quantity of
training points and containing a large space complexity. Dierently, in this
book, we will demonstrate a novel global learning method, named Minimum
Error Minimax Probability Machine (MEMPM). Although still in the framework of global learning, it does not belong to non-parametric learning, therefore requiring no extremely heavy storage spaces. Moreover, it does not
assume any specic distribution on data, which hence distinguishes itself
14
15
resources and is widely argued to be less direct. This motivates the local
learning which makes no attempt to model the data globally, but focuses on
extracting only those information directly related to the task. This type of
learning is often refereed to as discriminative learning in the context of classications. One famous model among them is Support Vector Machine (SVM).
With the task-oriented, robust, computationally tractable properties, SVM
has achieved a great success and is considered as the current state-of-theart classier. Although local learning demonstrates superior performance to
traditional global learning, it appears to situate itself at another extreme
end, which totally discards the useful global information, e.g. the structure
information of data.
Our suggestion is that we should combine these two dierent but complementary paradigms. Towards this end, we then propose a new model called
Maxi-Min Margin Machine (M4 ), which not only successfully employs the
global structure information from data but also holds merits of local learning
such as robustness and superior classication accuracies. As a critical contribution, M4 , the hybrid learning model represents a general model successfully
shown to contain both local learning models and global learning models as
special cases. More specically, it contains two signicant and popular global
learning models, i.e. Fisher Discriminant Analysis (FDA) [13] and Minimax
Probability Machine [28, 29, 30] as special cases. Meanwhile, SVM, the local
learning model can also be considered as one of its branches. In addition,
M4 also demonstrates a strong connection with MEMPM, the novel general
global learning model.
In the following, we rst present the problem denition which will be used
throughout this book. We then base Fig. 2.1 to provide introductions and
comments for each type of learning model sequently. Finally, we summarize
the review and conclude with the proposition of the hybrid framework, the
objective of this book.
16
explicitly, and bold typeface will indicate a vector or matrix, while normal
typeface will refer to a scale variable or the component of the vectors.
ck F
By employing Bayes theory, one can transform the above joint probability
(the item inside the integral) into the following equivalent forms:
p(ck , z|D, )p(|D)
.
ck F p(ck , z|D, )p(|D)d
p(ck , |D, z) =
(2.2)
Since the denominator in the above does not inuence the decision in
practice, the decision rule of Eq.(2.1) can be written into a relatively easilycalculated form:
c = arg max p(ck , z|D, )p(|D)d .
(2.3)
ck F
17
(2.4)
ck F
In the above, how are estimated, thus discriminates MAP from ML.
In MAP, are estimated as:
= arg max p(|D) ,
(2.5)
(2.6)
Observing Eq.(2.3), one can see that MAP actually enforces the approximated conditional distribution over parameters as a delta function situating
itself at the most prominent . Namely,
1, if = arg max p(|D)
p(|D) =
.
(2.7)
0, otherwise
For ML, it is even simpler. This can be observed by looking into the
relationship between MAP and ML:
arg max p(|D) = arg max p(D|)p() .
(2.8)
Thus, compared to MAP, ML omits the item p(), the prior probability
over the parameters. In practice, a model with a more complex structure
may be more possible to cause over-tting, which means the model can t
the training data perfectly while having a bad prediction ability on the test
or future data. In this sense, discarding the prior probability, ML lacks the
exibility to favor simple models by conditioning the prior probability [5, 49].
On the other hand, MAP permits a regularization on the prior probability
and thus contains potentials to resist over-tting problems.
When applied in practice, under independent, identically distributional
data (i.i.d.) conditions, rather than directly optimizing the original form, ML
estimations usually take the maximization on the log-likelihood, which can
transform the multiplication form into an easily-solved additional one:
= arg max p(D|) = arg max log p(D|) = arg max
N
j=1
18
(2.10)
(2.11)
19
Following this trend, many models are proposed. Among them are Bayesian
Point Machine [18, 36, 44] and Maximum Entropy Estimation [22]. Bayes
Point Machine restricts the averaging of the parameters in the version space
which denotes the space where the training data can be perfectly classied.
This proposed method is reported to contain a better generalization ability
within the global learning framework. But it is challenged to lack systematic
ways to extend its applications into non-separable datasets, where the version
space may include no candidate solutions. Maximum Entropy Estimation, on
the other hand, seems to provide a more exible and more systematic scheme
to perform the averaging of models. By trying to maximize an entropy-like
objective, Maximum Entropy Estimation demonstrates some characteristics
of both global learning and local learning. However, only two small datasets
are used to evaluate its performance. Moreover, the prior, usually unknown,
plays an important role in this model, but has to be assumed beforehand.
2.2.2 Non-parametric Learning
In contrast with generative learning discussed in the above, non-parametric
learning does not assume any specic global models before learning. Therefore, no risk will be taken on possible wrong assumptions on data. Consequently, non-parametric learning appears to set a more valid foundation
than generative learning models. Typical non-parametric learning models in
the context of classications consist of Parzen Window estimation [10] and
the widely used k-Nearest-Neighbor model [7, 43]. We will discuss these two
models in the following.
The Parzen Window estimation also attempts to estimate a density among
the training data. However it employs a totally dierent way. Parzen Window
rst denes an n-dimensional cell hypercube region RN over each observation.
By dening a window function:
1, |uj | 1/2, j = 1, 2, . . . , n
,
(2.12)
w(u) =
0, otherwise
the density is then estimated as:
pN (z) =
N
1 1
z zi
,
w
N i=1 hN
hN
(2.13)
20
density for data. More specically, the cell volume VN is designed as follows:
let the cell volume be a function of the training data, by centering a cell
around each point z j and increasing the volume until kN samples are contained, where kN depends on N . The local density for each observation is
then dened as
pN (z j ) =
kN /N
.
VN
(2.14)
When used for classications, the prediction is given by the class with the
maximum posterior probability, i.e.
c = arg max pN (ci |z) .
ci F
(2.15)
(2.16)
iF
Therefore, the prediction result is just the class with the maximum fraction
of the samples in a cell.
These non-parametric methods make no underlying assumptions on data
and appear to be more general in real cases. However, using no parameters
actually means using many parameters so that each parameter would not
dominate other parameters (in the discussed models, the data points can
be in fact considered as the parameters). In such a way, if one parameter
fails to work, it will not inuence the whole system globally and statistically.
However, using many parameters also results in serious problems. One of
the main problems is that the density is overwhelmingly dependent on the
training samples. Therefore, to generate an accurate density, the number of
samples needs to be very large (much larger than would be required if we perform the estimation by generative learning approaches). What is even worse
21
is that the number of data will unfortunately increase exponentially with the
dimension of data. Another disadvantage caused is its severe requirement for
the storage, since all the samples need to be saved beforehand in order to
predict new data.
2.2.3 The Minimum Error Minimax Probability Machine
Within the context of global learning, a dilemma seems existing: If we assume
a specic model as in generative learning, it loses the generality; if we use
instead non-parametric learning, it is impractical for high-dimension data.
One question is then proposed, can we have an approach which does not
require a large number of training samples for reducing complexities and also
does not assume specic models for maintaining the generality? Towards this
end, we propose Minimum Error Minimax Probability Machine (MEMPM)
in this book.
Unlike generative learning or non-parametric learning, Minimum Error
Minimax Probability Machine does not try to estimate a distribution over
data. Instead, it attempts to extract reliable global information from data and
estimates parameters for maximizing the minimal possibility that a future
data will fall into the correct class. More precisely, rather than seeking to
nd an accurate distribution, MEMPM focuses on studying the worst-case
probability (which is relatively robust) to predict data. In terms of the style
in making decisions, MEMPM is more like a local learning method due to
its direct optimization for classication and the task-oriented characteristic.
However, because MEMPM only summarizes global information from data
(not a distribution) as well, we still locate it in the framework of global
learning.
The proposed MEMPM contains many appealing features. Firstly, it represents a distribution-free Bayes optimal classier in the worst-case scenario.
A perfect balance is achieved by MEMPM in this way: No specic model is
assumed on data, since it is distribution-free. At the same time, although in
the worst-case scenario, it is also the Bayes optimal classier which is only
originally applicable in the cases with a known distribution. Another critical
feature of MEMPM is that under a mild condition, it contains an explicit
generalization bound. Furthermore, by exploring the bound, the recentlyproposed promising model, Minimax Probability Machine is clearly demonstrated to be its special case. Importantly, based on specifying a bound for
one class of data, a Biased Minimax Probability Machine is branched out
from MEMPM, which will be shown to provide a rigorous and systematic
treatment for biased classications. We will detail the MEMPM model and
BMPM model in the next chapter.
22
where l(z, c, ) is the loss function. Similar problems occur in the global
learning, since generally p(z, c) is unknown. Therefore, in practice, the above
expected risk is often approximated by the so-called empirical risk:
Remp () =
N
1 j j
l(z , c , ) .
N j=1
(2.18)
The above loss function describes the extent on how close the estimated
class disagrees with the real class for the training data. Various metrics can be
used for dening this loss function, including the 0 1 loss and the quadratic
loss [50].
However, considering only the training data may lead to the over-tting
problem again. In SVM, one big step in dealing with the over-tting problem
has been made, i.e. the margin between two classes should be pulled away
in order to reduce the over-tting risk. Fig. 2.3 illustrates the idea of SVM.
23
Two classes of data depicted as circles and solid dots are presented in this
gure. Intuitively observed, there are many decision hyperplanes which can be
adopted for separating these two classes of data. However, the one plotted in
this gure is selected as the favorable separating plane, because it contains the
maximum margin between two classes. Therefore, in the objective function
of SVM, a regularization term representing the margin shows up. Moreover,
as seen in this gure, only those lled points called support vectors mainly
determine the separating plane, while other points do not contribute to the
margin at all. In another word, only several local points are critical for the
classication purpose in the framework of SVM and thus should be extracted.
Actually, a more formal explanation and theoretical foundation can be
obtained from the Structure Risk Minimization criterion [6, 52]. Therein,
maximizing the margin between dierent classes of data is minimizing an
upper bound of the expected risk, i.e. the VC dimension bound [52]. However,
since the focus of this book does not lie in the theory of SVM, we will not go
further to discuss the details about this. Interested readers can refer to [51,
52].
24
References
25
References
1. Anand R, Mehrotram GK, Mohan KC, Ranka S (1993) An improved alogrithm
for neural network classication of imbalance training sets. IEEE Transactions
on Neural Networks 4(6):962969
2. Bahl LR, Brown PF, de Souza PV, Mercer RL (1993) Estimating hidden
Markov model parameters so as to maximize speech recognition accuracy. IEEE
Transactions on Speech and Audio Processing 1:7782
3. Barber CB, Dobkin DP, Huhanpaa H (1996) The quickhull algorithm for convex
hulls. ACM Transactions on Mathematical Software 22(4):469483
4. Beaufays F, Wintraub M, Konig Y (1999) Discriminative mixture weight estimation for large Gaussian mixture models. In Proceedings of the International
Conference on Acoustics, Speech and Signal Processing 337340
5. Brand M (1998) Structure discovery via entropy minimization. In Neural
Information Processing System 11
6. J Christopher, Burges C (1998) A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge Discovery 2(2):121167
7. Cover TM, Hart PE (1967) Nearest neighbor pattern classication. IEEE
Transactions on Information Theory IT-13(1):2127
8. Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines(and Other Kernel-based Learning Methods). Cambridge, U.K.; New
York, NY: Cambridge University Press
9. Duda R, Hart P (1973) Pattern Classication and Scene Analysis. New York,
NY: John Wiley & Sons
10. Duda RO, Hart PE, Stork DG (2000) Pattern Classication. New York, NY:
John Wiley & Sons
11. Fausett L (1994) Fundamentals of Neural Networks. New York, NY: Prentice
Hall
12. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classiers.
Machine Learning 29:131161
13. Fukunaga K (1990) Introduction to Statistical Pattern Recognition. San Diego,
Academic Press, 2nd edition
14. Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov Chain Monte Carlo
in Practice. London: Chapman & Hall
15. Grzegorzewski P, Hryniewicz O, Gil M (2002) Soft Methods in Probability,
Statistics and Data Analysis. Heidelberg; New York: Physica-Verlag
26
References
References
27
36. Minka T (2001) A family of Algorithms for Approximate Inference. PhD thesis,
Massachusetts Institute of Technology
37. Neal RM (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University
of Toronto
38. Neal RM (1998). Suppressing random walks in Markov chain Monte Carlo using
ordered overrelaxation M. I. Jordan (editor) Learning in Graphical Models,
Dordrecht: Kluwer Academic Publishers 205225
39. Patterson D (1996) Articial Neural Networks. Singapore: Prentice Hall
40. Pearl J (1988) Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. San Francisco, CA: Morgan Kaufmann
41. Pinto RL, Neal RM (2001) Improving Markov chain Monte Carlo estimators
by coupling to an approximating chain. Technical Report No. 0101, Dept. of
Statistics, University of Toronto
42. Rathinavelu C, Deng L (1996) The trended HMM with discriminative training
for phonetic classication. In Proceedings of ICSLP
43. Ripley BD (1996) Pattern Recognition and Neural Networks. Press Syndicate
of the University of Cambridge
44. Rujam R (1997) Preceptron learning by playing billiards. Neural Computation
9:99122
45. Sch
olkopf B, Burges C, Smola A (1999) Advances in Kernel Methods: Support
Vector Learning. Cambridge, MA: The MIT Press
46. Sch
olkopf B , Smola A (2002) Learning with Kernels: Support Vector Machines,
Regularization, Optimization and Beyond. Cambridge, MA: The MIT Press
47. Smola AJ, Bartlett PL, Scholkopf B, Schuurmans D (2000). Advances in Large
Margin Classiers. Cambridge, MA: The MIT Press
48. Stolcke A, Omohundro S (1993) Hidden Markov model induction by Bayesian
model merging. In NIPS 5: 1118
49. Tipping M(1999) The relevance vector machine. In Advances in Neural Information Processing Systems 12 (NIPS)
50. Trivedi PK (1978) Estimation of a distributed lag model under quadratic loss.
Econometrica 46(5):11811192
51. Vapnik VN (1998) Statistical Learning Theory. New York, NY: John Wiley &
Sons
52. Vapnik VN (1999) The Nature of Statistical Learning Theory. New York, NY:
Springer, 2nd edition
53. Woodland P, Povey D (2000) Large scale discriminative training for speech
recognition. In Proceedings of ASR 2000
54. Zhang W, King I (2002) A study of the relationship between support vector
machine and Gabriel Graph. In Proceedings of IEEE World Congress on Computational IntelligenceInternational Joint Conference on Neural Networks
3
A General Global Learning Model: MEMPM
30
P r{y S} =
1
, with d2 = inf (y y)T 1
y (y y) , (3.1)
yS
1 + d2
where the supremum is taken over all distributions for y containing the mean
as y and the covariance matrix as y 1 .
The theory provides us with a possibility to assume no model, but bound
the probability of misclassifying a point and consequently develop a novel
classier within the framework of global learning. More specically, one can
design a linear separating plane by replacing S with a half space associated
1
We assume y to be positive denite for simplicity. Otherwise, we can always
add a small positive amount to its diagonal elements to force its positive denition.
31
with this linear plane. To take the supremum can then be considered to
bound the misclassication rate for one class of data. We in the following,
rst introduce the model denition and then show how this theory can be
applied therein for deriving a distribution-free classier.
,,w=0,b
s.t.
(3.2)
inf
P r{wT x b} ,
(3.3)
inf
P r{wT y b} ,
(3.4)
x(x, x )
y(y, y )
32
(a)
(b)
33
,,w=0,b
s.t.
inf
P r{wT x b} ,
inf
P r{wT y b} 0 ,
x(x, x )
y(y, y )
2
Another interpretation of the dierence between MEMPM and MPM can be
stated from the viewpoint of Game Theory. MPM can be regarded as a noncooperative competitive game. In this game, each player (class) tries to maximize
its individual benet, i.e. . The competition leads to each class obtaining the same
benet when all classes fulll a kind of equilibrium. However, in the game theory,
many models, e.g. the prisoners dilemma, Counot Model and the tragedy of the
commons [21], have stated that maximizing individual benet does not lead to
maximizing the global optimum. Our model, on the contrary, can be considered as
a kind of cooperative game. It achieves the global optimum through cooperation.
34
y(y, y )
P r{wT y b}
.
holds if and only if b wT y () wT y w with () = 1
The lemma can be proved according to the Marshall and Olkin Theory
and the Lagrangian Multiplier theory.
3
Although cross validations could be used to provide empirical connections, they
are problem-dependent and are usually slow procedures as well.
35
Pr {wT y b} =
1
,
1 + d2
with d2 =
inf (y y)T 1
y (y y) .
wT yb
inf (y y)T 1
y (y y) =
wT yb
max (b wT y, 0)2
.
wT y w
multiplier:
{g, } = arg min arg max{g T g + (q pT g)},
g
where the multiplier 0. Therefore, one can get the following equalities:
g=
p
,
2
q = pT g.
2q
,
pT p
g=
dp
.
pT p
inf (y y)T 1
y (y y) =
wT yb
(b wT y)2
.
wT y w
(2) If wT y b.
In this case, we can only have y = y. Therefore, d = 0.
By integrating the above, we thus complete the proof of this theorem.
By using Lemma 3.2 we can transform the BMPM optimization problem
as follows:
max
,w=0,b
s.t.
b + wT x () wT x w ,
b wT y (0 ) wT y w ,
(3.5)
(3.6)
(3.7)
36
where () =
1 ,
(0 ) =
0
10 .
wT y + (0 ) wT y w b wT x () wT x w .
(3.8)
If we eliminate b from this inequality, we obtain:
wT (x y) () wT x w + (0 ) wT y w .
(3.9)
,w=0
s.t.
(3.10)
1 () wT x w + (0 ) wT y w ,
(3.11)
w (x y) = 1 .
(3.12)
1 (0 ) wT y w
()
.
(3.13)
wT x w
Because () increases monotonically with , maximizing is equivalent
to maximizing (), which further leads to:
1 (0 ) wT y w
,
max
w=0
wT x w
s.t.
wT (x y) = 1 .
This kind of optimization is called Fractional Programming (FP) problem [13, 19, 26]. To elaborate further, this optimization is equivalent to solving
the following fractional problem:
max
w=0
f (w)
,
g(w)
(3.14)
37
T
T
subject to
b + wT x () wT x w ,
b wT y () wT y w .
max
,,w=0,b
s.t.
(3.15)
(3.16)
(3.17)
,,w=0
s.t.
(3.18)
1 () wT x w + () wT y w ,
(3.19)
wT (x y) = 1 .
(3.20)
38
1 > () wT y w + () wT x w .
A new solution constructed by increasing or () by a small positive
amount,4 and maintaining , w unchanged will satisfy the constraints and
will be a better solution.
By applying Lemma 3.4 we can transform the optimization problem
Eq.(3.18) under the constraints of Eqs.(3.19) and (3.20) as follows:
2 ()
max
+ (1 ) ,
(3.21)
,w=0
2 () + 1
s.t.
where
wT (x y) = 1 ,
(3.22)
1 () wT y w
() =
wT x w
.
In Eq.(3.22), if we x to a specic value within [0, 1), the optimization
is equivalent to maximizing 2 ()/2 () + 1 and further equivalent to maximizing (), which is exactly the BMPM problem. We can then update
according to some rules and repeat the whole process until an optimal is
found. This is also the so-called line search problem [2, 1]. More precisely,
if we denote the value of optimization as a function f (), the above procedure corresponds to nding an optimal to maximize f (). Instead of using
an explicit function as in traditional line search problems, the value of the
function here is implicitly given by a BMPM optimization procedure.
Many methods can be used to solve the line search problem. In this
chapter, we use the Quadratic Interpolation (QI) method [2]. As illustrated
in Fig.3.2, QI nds the maximum point by updating a three-point pattern
(1 , 2 , 3 ) repeatedly. The new denoted by new is given by the quadratic
interpolation from the three-point pattern. Then a new three-point pattern
is constructed by new and two of 1 , 2 , 3 . This method can be shown to
converge superlinearly to a local optimum point [2]. Moreover, as shown in
Section 3.7, although MEMPM generally cannot guarantee its concavity, empirically it is often a concave problem. Thus the local optimum will be often
the global optimum in practice.
4
Since () increases monotonically with , increasing by a small positive
amount corresponds to increasing () by a small positive amount.
39
(
)
w
(3.23)
40
inf
xN (x, x )
wT x w
b wT x
= 1
wT x w
b + wT x
,
=
wT x w
(3.24)
where (z) is the cumulative distribution function for the standard normal
Gaussian distribution dened as:
2
z
1
s
ds.
(z) = P r{N (0, 1) z} =
exp
2
2
Due to the monotonic property of (z), we can further write Eq.(3.24) as:
b + wT x 1 () wT x w .
Constraint Eq.(3.4) can be reformulated to a similar form. The optimization
Eq.(3.2) is thus changed as:
{ + (1 )} ,
b + wT x 1 () wT x w ,
b wT y 1 () wT y w .
max
,,w=0,b
s.t.
(3.25)
(3.26)
The above optimization is nearly the same as Eq.(3.2) subject to the con1
straints
of Eqs.(3.3) and (3.4) except that, () is equal to (), instead
of
1 . Thus, it can be similarly solved based on the Sequential Biased
Minimax Probability Machine method.
On the other hand, the Bayes optimal hyperplane corresponds to the one,
wT z = b, which minimizes the Bayes error:
min
w=0,b
41
,
wT x w
denoted as N S, is independent of w, as the case in Gaussian distribution,
the similar MEMPM version as in Gaussian distribution assumption will be
easily derived, except that (z) is changed as P r{N S(0, 1) z}. In such
case, minimizing the Bayes error bound will exactly minimize the true Bayes
error.
Before presenting Proposition 3.7, we rst introduce the Central Limit
Theorem under the Lyapunov condition [5].
Theorem 3.6. Let xn be a sequence of independent random variables dened
on the same probability space. Assume that xn has nite expected value n
n
and nite standard deviation n . We dene s2n =
i2 . Assume that the
third central moment
rn3
n
i=1
i=1
One interesting nding directly elicited from the above Central Limit
Theorem is that, if the component variable xi of a given n-dimensional random variable x satises the Lyapunov condition, the sum of weighted component variables xi , 1 i n, namely, wT x tends to be a Gaussian distribution, as n grows.5 This shows that, under the Lyapunov condition, when
the dimension n grows, the hyperplane derived by MEMPM with Gaussian
assumption tends to be the true Bayes optimal hyperplane. In this case, the
MEMPM using 1 (),
the inverse function of the normal cumulative distribution, instead of /(1 ), will converge to the true Bayes optimal
decision hyperplane in the high-dimensional space. We summarize the analysis into Proposition 3.7.
Proposition 3.7. If the component variable xi of a given n-dimensional random variable x satises the Lyapunov condition, the MEMPM hyperplane derived by using 1 () the inverse function of normal cumulative distribution,
will converge to the true Bayes optimal one.
The underlying justications in the above two propositions root in the
fact that the generalized MPM is exclusively determined by the rst and second moments. These two propositions actually emphasize the dominance of
the rst and second moments in representing data. More specically, Proposition 3.5 hints that the distribution is only decided by up to the second
5
Some techniques such as Independent Component [8] can be applied to decorrelate the dependence among random variables beforehand.
42
moment. The Lyapunov condition in Proposition 3.7 also implies that the
second order moment dominates the third order moment in the long run. It
also deserves attention that with the xed mean and covariance, the distribution of Maximum Entropy Estimation is the Gaussian distribution [14]. This
would once again suggest the usage of 1 () in the high-dimensional space.
3.2.6 Geometrical Interpretation
In this section, we rst provide a parametric solving method for BMPM, then
demonstrate that this parametric method actually enables a nice geometrical
interpretation for both BMPM and MEMPM.
3.2.6.1 A Parametric Method for BMPM
According to the parametric method, the fractional function can be iteratively optimized in two steps [26]:
Step 1. Find w by maximizing f (w) g(w) in the domain A, where R
is the newly introduced parameter.
Step 2. Update by f (w)/g(w).
The iteration of the above two steps will guarantee to converge to the local
maximum which is also the global maximum in our problem. In the following,
we adopt a method to solve the maximization problem in Step 1. Replacing
f (w) and g(w), we expand the optimization problem as:
(3.29)
where , R. This optimization form is very similar to the one in Minimax
Probability Machine [15] and can also be solved by using an iterative leastsquares approach.
3.2.6.2 A Geometrical Interpretation for BMPM and MEMPM
The parametric method actually enables a nice geometrical interpretation of
BMPM and MEMPM in a fashion similar to that of MPM in [16]. Similarly,
43
s.t. u2 1, v2 1 .
We change the order of the min and max operators and consider the min:
min {uT x 1/2 w + (0 )v T y 1/2 w + (1 wT (x y))}
,
if x x 1/2 u = y + (0 ) y 1/2 v;
=
, otherwise.
w=0
,u,v
,u,v
(3.32)
(3.33)
Hy () = {y = y + (0 ) y
(3.34)
1/2
v : v2 }.
The above optimization involves nding a minimum for which two ellipsoids intersect. For the optimum , these two ellipsoids would be tangent to
each other. We further note that, according to Lemma 3.4, at the optimum,
, which is maximized via a series of the above procedures, would satisfy
1 = x 1/2 w 2 + (0 ) y 1/2 w 2 = = 1/ ,
= 1 .
(3.35)
(3.36)
This means that the ellipsoid for the class y nally changes to the one
centered at y, whose Mahalanobis distance to y is exactly equal to (0 ).
Moreover, the ellipsoid for the class x would be the one centered at x and
44
tangent to the ellipsoid for the class y. In comparison, for MPM, two ellipsoids grow with the same speed (with the same () and ()). On the
other hand, since MEMPM corresponds to solving a sequence of BMPMs,
it similarly leads to a hyperplane tangent to two ellipsoids, which achieves
to minimize the maximum of the worst-case Bayes error. Moreover, it is not
necessarily attained in a balanced way as in MPM, i.e. two ellipsoids do not
necessarily grow with the same speed and hence probably contain the unequal
Mahalanobis distance from their corresponding centers. This is illustrated in
Fig. 3.3.
45
,,w=0,b
s.t.
(3.37)
inf
P r{wT x b} , (
x, x ) X ,
(3.38)
inf
P r{wT y b} , (
y, y ) Y ,
(3.39)
x(
x, x )
y(
y , y )
where X and Y are the sets of means and covariance matrices and are the
subsets of RPn+ , where Pn+ is the set of nn symmetric positive semidefinite
matrices.
Motivated by the tractability of the problem and from the statistical view,
a specific setting of X and Y is proposed in [16]. However, they consider the
same variations of the means for two classes, which is easy to handle but less
general. Now, considering the unequal treatment of each class, we propose
the following setting which is in a more general and complete form:
0 ) x1 (
0 ) x2 , x x x0 F x ,
X = (
x, x ) | (
xx
xx
0 ) y 1 (
0 ) y2 , y y y 0 F y ,
yy
yy
Y = (
y , y ) | (
0 , 0x are the nominal means and covariance matrices obtained
where x
through estimating. Parameters x , y , x , and y are positive constants.
The matrix norm is dened as the Frobenius norm: M 2F = Tr(M T M ).
With the assumption that variations of the means for two classes are the
same, the parameters x and y are required equal in [16]. This may enable
the direct usage of the MPM optimization into its robust version. However,
the assumption may not be true in real cases. Moreover, in MEMPM, this
requirement is also not necessary and inappropriate. This will be later demonstrated in the experiment.
By applying the results from [16], we obtain the robust MEMPM as:
max { + (1 )} ,
,,w=0,b
(() + x ) wT ( x0 + x I n )w,
s.t. b + w x
0 (() + y ) wT ( y 0 + y I n )w.
b wT y
T
,,w=0
2r ()
+ (1 ) ,
1 + 2r ()
0 ) = 1,
x0 y
s.t. wT (
(3.40)
(3.41)
46
where r () = max
1(()+y )
wT ( y 0 +y I n )w
wT y( x0 +x I n )w
x , 0 , and thus can be
3.4 Kernelization
We note that, in the above, the classier derived from MEMPM is given in
a linear conguration. In order to handle nonlinear classication problems,
in this section, we seek to use the kernelization trick [22] to map the ndimensional data points into a high-dimensional feature space Rf , where a
linear classier corresponds to a nonlinear hyperplane in the original space.
Since the optimization of MEMPM corresponds to a sequence of BMPM
optimization problems, this model naturally inherits the kernelization ability of BMPM. We thus in the following mainly address the kernelization of
BMPM.
Ny
x
Assuming training data points are represented by {xi }N
i=1 and {y j }j=1
for the class x and y, respectively, the kernel mapping can be formulated as:
x (x) ((x), (x) ) ,
y (y) ((y), (y) ) ,
where : Rn Rf is a mapping function. The corresponding linear classier in Rf is wT (z) = b, where w, (z) Rf , and b R. Similarly, the
transformed FP optimization in BMPM can be written as:
1 (0 ) wT (y) w
max
, s.t. wT ((x) (y)) = 1. (3.42)
w=0
T
w (x) w
However, to make the kernel work, we need to represent the nal decision
hyperplane and the optimization in a kernel form, K(z 1 , z 2 ) = (z 1 )T (z 2 ),
namely an inner product form of the mapping data points.
3.4 Kernelization
47
Nx
i (xi ) ,
i=1
(x) = x I n +
(y) =
Ny
j (y j ) ,
j=1
Nx
i=1
(y) = y I n +
Ny
j=1
48
(x) =
Nx
1
(xi ) ,
Nx i=1
(y) =
Ny
1
(y j ) ,
Ny j=1
(x) =
Nx
1
((xi ) (x))((xi ) (x))T ,
Nx i=1
(y) =
Ny
1
((y j ) (y))((y j ) (y))T ,
Ny j=1
Nx
i (xi ) +
i=1
Ny
j (y j ),
(3.43)
j=1
i = 1, 2, . . . , Nx ,
i = Nx + 1, Nx + 2, . . . , N.
3.4 Kernelization
x ]i :=
[k
49
Ny
Nx
1
y ]i := 1
K(xj , z i ) , [k
K(y j , z i ) .
Nx j=1
Ny j=1
In the above, 1Nx RNx and 1Ny RNy , are dened as:
1i = 1,
1j = 1,
i = 1, 2, . . . , Nx ,
j = 1, 2, . . . , Ny .
(3.44)
1 (0 ) N1y wT K
y Kyw
,
( ) = max
w=0
1
TK
T
xw
w
K
x
Nx
s.t.
x k
y ) = 1 .
wT (k
wT
kx
1 T T
1 T T
T
( )
w K x K x w = w ky + (0 )
w K K y w ,
Nx
Ny y
Nx
i=1
wi K(z, xi ) +
Ny
i=1
wNx +i K(z, y i ) b .
50
3.5 Experiments
In this section, we rst evaluate our model on a synthetic dataset. Then we
compare the performance of MEMPM with that of MPM, on six real-world
benchmark datasets (since MPM is reported comparable to SVM, we do
not perform comparisons with SVM). To demonstrate that BMPM is ideal
for imposing a specied bias in classication, we also implement it on the
Heart-disease dataset. The means and covariance matrices for two classes are
obtained directly from the training datasets by plug-in estimations. The prior
probability is given by the proportion of x data in the training dataset.
3.5.1 Model Illustration on a Synthetic Dataset
To verify that the MEMPM model achieves the minimum Bayes error rate
in the Gaussian distribution, we synthetically generate two classes of twodimensional Gaussian data. As plotted in Fig. 3.4(a), data associated with the
class x are generated with the mean x as [3, 0]T and the covariance matrix x
as [4, 0; 0, 1], while data associated with the class y are generated with the
mean y as [1, 0]T and the covariance matrix y as [1, 0; 0, 5]. The solved
decision hyperplane z1 = 0.333 given by MPM is plotted as the solid line
and the solved decision hyperplane z1 = 0.660 given by MEMPM is plotted
as the dashed line. From the geometrical interpretation, both hyperplanes
should be perpendicular to the z1 axis.
As shown in Fig. 3.4(b), the MEMPM hyperplane exactly represents the
optimal thresholding under the distributions of the rst dimension for two
classes of data, i.e. the intersection point of two density functions. On the
other hand, we nd that the MPM hyperplane exactly corresponds to the
thresholding point with the same error rate for two classes of data, since the
cumulative distribution Px (z1 < 0.333) and Py (z1 > 0.333) are exactly the
same.
3.5.2 Evaluations on Benchmark Datasets
We next evaluate our algorithm on six benchmark datasets. Data for the
Twonorm problem were generated according to [4]. The rest ve datasets
including the Breast, Ionosphere, Pima, Heart-disease, and Vote data were
obtained from UCI machine learning repository [3]. Since handling the missing attribute values is out of the scope of this chapter, we simply remove
instances with missing attribute values in these datasets.
We randomly partition data into 90% training and 10% test sets. The
nal results are averaged over 50 random partitions of data. We compare the
performance of MEMPM and MPM in both the linear setting and Gaussian
kernel setting. The width parameter () for the Gaussian kernel is obtained
3.5 Experiments
51
via cross validations over 50 random partitions of the training set. The experimental results are summarized in Tables 3.1 and 3.2 for the linear kernel
and Guassian kernel respectively.
From the results we can see that our MEMPM demonstrates better performance than MPM in both the linear and Gaussian kernel setting. Moreover,
as observed in these benchmark datasets, the MEMPM hyperplanes are ob-
52
Dataset
+ (1 ) Accuracy
Performance of MPM(%)
Accuracy
Twonorm
80.1 0.1
97.9 0.1
Breast
86.7 0.5
97.0 0.2
Ionosphere
74.5 0.8
84.8 0.8
Pima
41.3 0.8
76.1 0.6
56.3 1.4
83.2 0.8
83.9 0.9
94.8 0.4
Vote
Table 3.2. Lower bound , , and test accuracy compared to MPM in the
Gaussian kernel
Performance of MEMPM(%)
Dataset
+ (1 ) Accuracy
Performance of MPM(%)
Accuracy
Twonorm
91.7 0.2
97.9 0.1
Breast
89.9 0.4
96.9 0.3
Ionosphere
89.4 0.8
92.2 0.4
Pima
41.4 1.1
76.2 0.6
58.0 1.5
83.1 1.0
84.7 0.8
94.6 0.4
Vote
tained with signicantly unequal and except in the Twonorm set. This
further conrms the validity of our proposition, i.e. the optimal minimax machine is not certain to achieve the same worst-case accuracies for two classes.
For the Twonorm, it is also not an exception. The two classes of data in this
set are generated under the multivariate normal distributions with the same
covariance matrices. In this special case, the intersection point of two density
functions will exactly represent the optimal thresholding point and the one
with the same error rate for each class as well. Another important nding is
that the accuracy bounds, namely + (1 ) in MEMPM and in MPM
are all increased in the Gaussian kernel setting when compared with those
in the linear setting. This shows the advantage of the kernelized probability
machine over the linear probability machine.
In addition, to clearly see the relationship between the bounds and the
test set accuracies (T SA), we plot them in Fig. 3.5. As observed, the test
set accuracies including T SAx (for the class x), T SAy (for the class y), and
the overall accuracies T SA are all greater than their corresponding accuracy
bounds both in MPM and MEMPM. This demonstrates how the accuracy
bound can serve as the performance indicator on future data.
3.5 Experiments
53
Fig. 3.5. Empirical evaluations on bounds and test set accuracies of MEMPM. The
test accuracies including T SAx (for the class x), T SAy (for the class y), and the
overall accuracies T SA are all greater than their corresponding accuracy bounds
both in MPM and MEMPM. This demonstrates how the accuracy bound can serve
as the performance indicator on future data
54
Since the lower bounds keep well with the test accuracies in the above
experimental results, we do not perform the robust version of both models for
the real-world datasets. To see how the robust version works we generate two
classes of Gaussian data. As illustrated in Fig. 3.6, the x data are sampled
from the Gaussian distribution with the mean as [3, 0]T and the covariance
as [1 0; 0 3], while the y data are sampled from another Gaussian distribution
with the mean as [3, 0]T and the covariance as [3 0; 0 1]. We randomly select
3.5 Experiments
55
10 points of each class for training and leave the rest points for test from the
above synthetic dataset. We present the result in the following.
0 and y
0 , covariance matriFirst, we calculate the corresponding means x
0
0
ces x and y and plug them into the linear MPM and the linear MEMPM.
We obtain the MPM decision line (dotted line) with a lower bound (assuming
the Gaussian distribution) being 99.1% and the MEMPM decision line (dashdot line) with a lower bound as 99.7% respectively. However, for the test set
we only obtain the accuracies 93.0% for MPM and 97.0% for MEMPM (see
Fig. 3.6(a)). This obviously violates the lower bound.
Based on our knowledge of the real means and covariance matrices in this
example, we set the parameters as
0 )T x1 (
0 ) = 0.046 ,
x = (
xx
xx
0 )T y1 (
0 ) = 0.496 ,
y = (
yy
yy
x = x x0 F = 1.561 ,
y = y y0 F = 0.972 ,
= max(x , y ) .
We then train the robust linear MPM and the robust linear MEMPM by
these parameters and obtain the robust MPM decision line (dashed line), the
robust MEMPM decision line (solid line), as seen in Fig. 3.6(a). The lower
bounds decrease to 87.3% for MPM and 93.2% for MEMPM respectively,
but the test accuracies increase to 98.0% for MPM and 100.0% for MEMPM.
Obviously, the lower bounds accord with the test accuracies.
Note that in the above, the robust MEMPM also achieves a better performance than the robust MPM. Moreover, x and y are not necessarily
the same. To see the result of MEMPM when x and y are forced to be
the same, we train the robust MEMPM again by setting the parameters as
x = y = as used in MPM. We obtain the corresponding decision line
(dash-dot line) as seen in Fig. 3.6(b). The lower bound decreases to 91.0%
and the test accuracy decreases to 98.0%. The above example indicates how
the robust MEMPM clearly improves over the standard MEMPM when a
bias is incorporated by the inaccurate plug-in estimates and also validates
that x need not be equal to y .
3.5.3 Evaluations of BMPM on Heart-disease Dataset
To demonstrate the advantages of the BMPM model in dealing with biased
classications, we implement BMPM on the Heart-disease dataset, where
dierent treatments for dierent classes are necessary. The x class is associated with data with heart diseases, whereas the y class corresponds to data
without heart diseases. Obviously, a bias should be considered for x, since
misclassication of an x case into the opposite class would delay the therapy
56
and is more risky than the other way round. Similarly, we randomly partition data into 90% training and 10% test sets. Also, the width parameter
() for the Gaussian kernel is obtained via cross validations over 50 random
partitions of the training set. We repeat the above procedures 50 times and
report the average results.
By intentionally varying 0 from 0 to 1, we obtain a series of test accuracies, including the x accuracy T SAx , the y accuracy T SAy for both the
linear and Gaussian kernels. For simplicity, we denote the x accuracy in the
linear setting as T SAx (L), while others are similarly dened.
The results are summarized in Fig. 3.5. Four observations are worth highlighting. First, in both linear and Gaussian kernel settings, the smaller 0 ,
the higher the test accuracy for x. This indicates a bias can be indeed embedded in the classication boundary for the important class x by specifying a
relatively smaller 0 . In comparison, MPM forces an equal treatment on each
class and thus is not suitable for biased classication. Second, the test accuracies for y and x are strictly lower bounded by 0 and . This shows how a bias
can be quantitatively, directly and rigorously imposed towards the important
class x. Note that again, for other weight-adapting-based biased classiers,
the weights themselves lack accurate interpretations and thus cannot rigorously impose a specied bias, i.e. they would try for dierent weights for a
specied bias. Third, when given a prescribed 0 , the test accuracy for x and
its worst-case accuracy in the Gaussian kernel setting are both increased
compared to the corresponding accuracies in the linear setting. Once again,
this demonstrates the power of the kernelization. Fourth, we note that 0
actually contains an upper bound which is around 90% for the linear BMPM
in this dataset. This is reasonable. Observed from Eq.(3.11), the maximum
0 denoted as 0 max is decided by setting = 0, i.e.
(0 max ) = max
w=0
wT
yw
s.t. wT (x y) = 1 .
(3.45)
It is interesting noting that when 0 is set to zero, the test accuracies for
y in the linear and Gaussian settings are both around 50% (see Fig. 3.7(b)).
This seeming irrationality is actually reasonable. We will discuss this in
the next section.
P r{wT y b} =
1
,
1 + d2
with d2 =
inf (y y)T 1
y (y y) .
wT yb
57
Fig. 3.7. Bounds and real accuracies for BMPM in Heart-disease dataset.
With 0 varying from 0 to 1, the real accuracies are lower bounded by the
worst-case accuracies. In addition, (G) is above (L), which shows the
power of the kernelization
Looking into the above equation and Eq.(3.4), for a given hyperplane
{w, b} we can easily obtain:
=
d2
.
1 + d2
(3.46)
Moreover, in [16], a simple closed-form expression for the minimum distance d is derived:
58
d2 =
inf (y y)T y 1 (y y) =
wT yb
max((b wT y), 0)
.
wT y w
(3.47)
It is easy to see that when the decision hyperplane (w, b) passes the center
y, d would be equal to 0 and the worst-case accuracy would be 0 according
to Eq.(3.46).
However, if we consider the Gaussian data (which we assume as y data)
in Fig. 3.8, a vertical line approximating y would achieve about 50% test
accuracy. The large gap between the worst-case accuracy and the real test
accuracy seems strange. In the following, we construct an example of onedimensional data to show the inner rationality of this observation. We attempt to provide the worst-case distribution containing the given mean and
covariance, while a hyperplane passing its mean achieves a real test accuracy
of zero.
Fig. 3.8. Theoretical comparison between the worst-case accuracy and the
real test accuracy for the Gaussian data in Fig. 3.10(a)
y =m+ ,
N
N 1 2
.
y =
N
When N goes to innity, the above one-dimensional data have the mean as m
and the covariance as . In this extreme case, a hyperplane passing the mean
will achieve a zero test accuracy which is exactly the worst-case accuracy
59
given the xed mean and covariance as m and respectively. This example
demonstrates the inner rationality of the minimax probability machines.
To further examine the tightness of the worst-case bound in Fig. 3.9(a),
we vary from 0 to 1 and plot against the real test accuracy that a vertical
Fig. 3.9. Three two-dimensional data with the same means and covariances but
with dierent skewness. The worst-case accuracy bound of (a) is tighter than that
of (b) and looser than that of (c)
line classies the y data by using Eq.(3.46). Note that the real accuracy can
be calculated as (z d). This curve is plotted in Fig. 3.10.
Fig. 3.10. Three two-dimensional data with the same means and covariances but with dierent skewness. The worst-case accuracy bound of (a) is
tighter than that of (b) and looser than that of (c)
Observed from Fig. 3.9, the smaller the worst-case accuracy, the looser it
is. On the other hand, if we skew the y data towards the left side, while simul-
60
taneously maintaining the mean and covariance unchanged (see Fig. 3.9(b)),
even a bigger gap will be generated when is small; analogically, if we skew
the data towards the right side (see Fig. 3.9(c)), a tighter accuracy bound will
be expected. This nding would mean that only adopting up to the second
order moments may not achieve a satisfactory bound. In other words, for a
tighter bound, higher order moments such as skewness need to be considered. This problem of estimating a probability bound based on moments is
presented as the (n, k, )-bound problem, which means nding the tightest
bound for n-dimensional variable in the set based on up to the k-th moments. Unfortunately, as proved in [24], it is NP-hard for (n, k, Rn )-bound
problems with k 3. Thus tightening the bound by simply scaling up the
moment order may be intractable in this sense. We may have to exploit other
statistical techniques to achieve this goal. Certainly, this deserves a closer
examination in the future.
1 (0 1 ) wT
1 (0 2 ) wT
1 y w1
2 y w2
,(3.48)
T
T
w1 x w1
w2 x w2
where, w1 and w2 are the corresponding optimal solutions which maximize
(1 ) and (2 ) respectively, when 0 1 and 0 2 are specied.
From 0 1 > 0 2 and Eq.(3.48), we have
1 (0 1 ) w1 T y w1
1 (0 2 ) wT
1 y w1
>
(3.49)
wT
w1 T x w1
1 x w1
1 (0 2 ) w2 T y w2
.
(3.50)
w2 T x w2
61
1 (0 2 ) wT y w
,
max
w=0
wT x w
we have
1 (0 2 ) wT
1 (0 2 ) wT
2 y w2
1 y w1
.
T w
wT
w
w
x 2
x 1
2
1
62
(a) Twonorm
(b) Breast
(c) Ionosphere
(d) Pima
(e) Heart-disease
(f) Vote
Fig. 3.11. The curves of against (f1 ) are all concave-like in the datasets
used in this chapter
63
k1 (dx )dx
.
k2 (dy )dy
k1 (dx )
.
k2 (dy )
derivative
of d /(1 + d ). It is easily veried that (d /(1 + d )) 0 when
d 1/ 3. This is also illustrated in Fig. 3.12. According to the denitionof
the second derivative, we immediately obtain the lemma. Note that d 1/ 3
corresponds to 0.25. Thus the condition can be also replaced by 0.25.
In the above procedure, dy , increase and dx , decrease as the hyperplane moves towards x. Therefore, according to Lemma 3.11, k1 (dx ) increases
while k2 (dy ) decreases when , [0.25, 1). This shows that f1
is getting
smaller as the hyperplane moves towards x. In other words, f1
would be
less than 0 and thus is concave when , [0.25, 1). It should be noted
that in many well-separated real world datasets, the optimal and will be
greater than 0.25 with a high possibility, since to achieve good performance,
the worst-case accuracies are naturally required to be greater than a smaller
amount, e.g. 0.25. This is observed in the datasets used in the chapter. All
the datasets except Pima attain their optimums satisfying this constraint.
For Pima, the overall accuracy is relatively lower, which implies that two
classes of data in this dataset appear to largely overlap each other7 .
An illustration can be also seen in Fig. 3.13. We generate two classes of
Gaussian data with x = [0, 0]T , y = [L, 0]T , and x = y = [1, 0; 0, 1].
The prior probability for each data is set as an equal value 0.5. We plot
the curves of f1 () and f1 () + when L is set as dierent values. It is
7
It is observed, even for Pima, the proposed solving algorithm is still successful,
since is approximately linear as shown in Fig. 3.11. Moreover, due to the fact
that the slope of is slightly greater than 1, the nal optimum naturally leads
to achieve its maximum.
64
observed that when two classes of data largely overlap each other, for example
in Fig. 3.12(a) with L = 1, the optimal solution of MEMPM lies in the
small-value range of and , which is usually not concave. On the other
hand, Fig. 3.12(b), (c), and (d) show that when two classes of data are wellseparated, the optimal solutions lie in the region with , [0.25, 1), which
is often concave.
Note that, in the above, we make an assumption that as the decision hyperplane moves, dx and dy change at an approximately xed proportional
65
2
2
Fig. 3.13.
The curve of d /(1 + d ). This function is concave when
d 1/ 3
66
more training time than MPM. In our experiments, MEMPM needs to solve
5 15 BMPM optimizations on the average. Supposing that BMPM is solved
based on Conjugate Gradient Methods (with a worst-case time complexity
in the same order as MPM), MEMPM would be 5 15 times as expensive as
MPM. Although in pattern recognition tasks, especially in o-line classications, eectiveness is often more important than eciency, expensive timecost presents one of the main limitations of the MEMPM model, in particular
for large scale datasets with millions of samples. To solve this problem, one
possible direction is to reduce those redundant points which actually make
less contributions to the classication. In this way, the problem dimension
(in the kernelization) would be greatly decreased and therefore may help in
reducing the computational time required. Another possible direction is to
exploit some techniques to decompose the Gram matrix (as is done in SVM)
and to develop some specialized optimization procedures for MEMPM. Undoubtedly, speeding up the algorithm will be a highly worthy topic in the
future.
Second, as a generalized model, MEMPM actually incorporates some
other variations. For example, when the prior probability () cannot be estimated reliably (e.g. in sparse data), maximizing +, namely the sum of the
accuracies or the dierence between true positive and false positive, would
be considered. This type of approaches is widely used in pattern recognition
eld, e.g. in medical diagnosis [10] and in graph detection, especially line
detection and arc detection, where it is called Vector Recovery Index [9, 17].
Moreover, when there are domain experts at hand, a variation of MEMPM,
namely, the maximization of Cx + Cy may be used, where Cx (Cy ) is the
cost of a misclassication of x (y) obtained from experts. Exploring these
variations in some specic domains is thus a valuable direction in the future
(we actually will discuss these variations as criteria for biased or imbalanced
learning in Chapter 5).
Third, [16] has built up a connection between MPM and SVM from the
perspective of the margin denition, i.e. MPM corresponds to nding the
hyperplane with the maximal margin from the class center. Nevertheless,
some deeper connections need to be investigated, e.g. how is the bound of
MEMPM related to the generation bound of SVM? More recently, [11] and
also the next chapter have disclosed the relationship between them from
either a local or global viewpoint of data. It is particularly useful to look into
these links and explore their further connections in the future.
3.9 Summary
In this chapter, we have proposed a novel global learning model named Minimum Error Minimax Probability Machine. By minimizing the upper bound of
the Bayes error of future data points, our model derives the distribution-free
Bayes optimal hyperplane in the worst-case setting. This thus distinguishes
References
67
References
1. Bazaraa MS (1993) Nonlinear Programming: Theory and Algorithms. New
York, NY: John Wiley & Sons, 2nd edition
2. Bertsekas DP (1999) Nonlinear Programming. Athena Scientic, Belmont,
Massachusetts, 2nd edition
3. Blake CL, Merz CJ(1998) Repository of machine learning databases, University
of California, Irvine, http://www.ics.uci.edu/mlearn/MLRepository.html
4. Breiman L(1997) Arcing Classiers. Technical Report 460, Statistics Department, University of California
5. Chow YS, Teicher H(1997) Probability Theory: Independence, Interchangeability, Martingales. New York, NY: Springer-Verlag, 3rd edition
6. Craven BD (1978) Mathematical Programming and Control Theory. London,
UK: Chapman & Hall
7. Craven BD (1988) Fractional Programming, Sigma Series in Applied Mathematics 4. Berlin: Heldermann Verlag
8. Deco G, Obradovic D (1996) An Information-theoretic Approach to Neural
Computing. Heidelberg; New York: Springer-Verlag
9. Dori D, Liu W (1999) Sparse pixel vectorization: An algorithm and its performance evaluation. IEEE Trans. Pattern Analysis and Machine Intelligence
21:202215
10. Grzymala-Busse JW, Goodwin LK, Zhang X (2003) Increasing sensitivity of
preterm birth by changing rule strengths. Pattern Recognition Letters 24:903
910
11. Huang K, Yang H, King I, Lyu MR (2004) Learning large margin classiers
locally and globally. In The 21st International Conference on Machine Learning
(ICML-2004)
12. Huang K, Yang H, King I, Lyu MR, Chan L (2003) Biased minimax probability
machine for medical diagnosis. In the Eighth International Symposium on
Articial Intelligence and Mathematics
13. Ibaraki T (1981). Solving mathematical programming problems with fractional
objective functions In S. Schaible and W. T. Ziemba., editors, Generalized
68
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
References
Concavity in Optimization and Economics. New York, NY: Academic Press
441472
Keysers D, Och FJ, Ney H(2002) Maximum entropy and Gaussian models
for image object recognition. In Proceedings of the 24th DAGM Symposium,
Lecture Notes in Computer Science. Heidelberg: Springer-Verlag, LNCS 2449:
498506
Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI (2001) Minimax
probability machine. In Advances in Neural Information Processing Systems
(NIPS)
Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI (2002) A robust
minimax approach to classication. Journal of Machine Learning Research
3:555582
Liu W, Dori D (1997) A protocol for performance evaluation of line detection
algorithms. Machine Vision and Application 9:240250
Maloof MA, Langley P, Binford TO, Nevatia R, Sage S (2003) Improved rooftop
detection in aerial images with machine learning. Machine Learning 53:157191
Mangasarian Olvi L (1994) Nonlinear Programming. Philadelphia: Society for
Industrial and Applied Mathematics
Marshall AW, Olkin I (1960) Multivariate Chebyshev inequalities. Annals of
Mathematical Statistics 31(4):10011014
Moulin Herv
e (1995) Cooperative Microeconomics: a game-theoretic introduction. Princeton, NJ: Princeton University Press
M
uller KR, Mika S, R
atsch G, Tsuda K, Sch
o1kopf B (2001) An introduction
to Kernel-based Learning Algorithms. IEEE Transactions on Neural Networks
12:181201
Osuna E, Freund R, Girosi F (1997) Support Vector Machines: Training and
Applications. Technical Report AIM-1602, Cambridge, MA: The MIT Press
Popescu I, Bertsimas D (2001) Optimal inequalities in probability theory: A
convex optimization approach. Technical Report TM62, INSEAD
Schaible S (1977) Fractional programming. Zeitschrift f
ur Operational Research, Serie A 27(1):3954
Schaible S (1995) Fractional programming. In R. Horst and P. M. Pardalos,
editors, Handbook of Global Optimization, Nonconvex Optimization and its
Applications. Dordrecht,Boston,London: Kluwer Academic Publishers 495608
Sch
olkopf B, Smola A(2002) Learning with Kernels. Cambridge, MA: The MIT
Press
4
Learning Locally and Globally: Maxi-Min
Margin Machine
The proposed MEMPM model obtains the decision hyperplane by using only
global information, e.g. the mean and covariance matrices. However, although
these moments can be more reliably obtained than estimating the distribution, they may still be inaccurate in many cases, e.g. when the data are very
sparse.
Recently, local learning methods, especially large margin classiers [19]
have attracted much interest in the community of machine learning and pattern recognition. Support Vector Machine (SVM) [25], the most famous one
of them, represents a state-of-the-art classier. The essential point of SVM
is to nd a linear separating hyperplane, which achieves the maximal margin among dierent classes of data. Furthermore, one can extend SVM to
build nonlinear separating decision hyperplanes by exploiting kernelization
techniques.
These methods do not try to summarize any global information beforehand, but to focus on obtaining the decision hyperplane in a local way. For
example, in SVM the decision boundary is exclusively determined by some
critical points which are called support vectors, whereas all other points are
totally irrelevant to this hyperplane. Although this scheme is both theoretically and empirically demonstrated to be powerful, it actually discards the
global information of data.
An illustration example can be seen in Fig. 4.1. In this gure, the classication boundary is intuitively observed to be mainly determined by the
dotted axis, i.e. the long axis of the y data (represented by s) or the short
axis of the x data (represented by s). Moreover, along this axis, the y data
are more possible to scatter than the x data, since y contains a relatively
larger variance in this direction. Noting this global fact, a good decision
hyperplane seems reasonable to lie closer to the x side (see the dash-dot line).
However, SVM ignores this kind of global information, i.e. the statistical
trend of data occurrence: the derived SVM decision hyperplane (the solid
70
line) lies unbiasedly right in the middle of two local points (the support
vectors)1 .
71
cal interpretation, connections with other models, and the associated solving
methods. In Section 4.2, we derive a generation bound for the M4 model. In
Section 4.3, we develop a reduction method to remove redundant points for
decreasing the computational time. In Section 4.4, we exploit the kernelization trick to extend M4 to nonlinear classication tasks. In Section 4.5, we
evaluate this novel model on both synthetic datasets and real world benchmark datasets. In Section 4.6, we make discussions on the M4 model and also
present future work. Finally, we conclude this chapter in Section 4.7. This
work can be also seen in [5] [7] for a short version.
,w=0,b
(wT xi + b)
, i = 1, 2, . . . , Nx ,
s.t.
wT x w
(wT y j + b)
, j = 1, 2, . . . , Ny ,
wT y w
(4.1)
(4.2)
(4.3)
72
where x and y refer to the covariance matrices of the x and the y data,
respectively.
This model tries to maximize the margin dened as the minimum Mahalanobis distance for all training samples,while simultaneously classifying all
the data correctly. Compared to SVM, M4 incorporates the data information
in a global way; namely, the covariance information of data or the statistical
trend of data occurrence is considered, while SVMs, including l1 -SVM [27]
and l2 -SVM [24] (lp -SVM means the p-norm distance-based SVM) [19],
simply discard this information or consider the same covariance for each
class.
4.1.1.2 Geometrical Interpretation
A geometrical interpretation of M4 can be seen in Fig. 4.2. In this gure, the
Fig. 4.2. A geometric interpretation of M4 . The M4 hyperplane corresponds to the tangent line (the solid line) of two small dashed ellipsoids
centered at the support vectors (the local information) and shaped by the
corresponding covariances (the global information). It is thus more reasonable than SVM (the dotted line)
x data are represented by the inner ellipsoid on the left side with its center
as x0 , while the y data are represented by the inner ellipsoid on the right
side with its center as y 0 . It is observed that these two ellipsoids contain
unequal covariances or risks of data occurrence. However, SVM does not
consider this global information: its decision hyperplane (the dotted line) is
located unbiasedly in the middle of two support vectors (lled points). In
comparison, M4 denes the margin as a Maxi-Min Mahalanobis distance,
which thus constructs a decision plane (the solid line) with considerations
of both the local and global information: the M4 hyperplane corresponds to
73
the tangent line of two dashed ellipsoids centered at the support vectors (the
local information) and shaped by the corresponding covariances (the global
information).
4.1.1.3 Optimization Method
In the following, we propose the optimization method for the M4 model. We
will demonstrate that the above problem can be cast as a sequential Conic
Programming problem, or more specically, a sequential SOCP problem.
Our strategy is based on the Divide and Conquer technique. One may
note that in the optimization problem of M4 , if is xed to a constant n , the
problem is exactly changed to conquer the problem of checking whether
the constraints of Eqs.(4.2) and (4.3) can be satised. Moreover, as will be
demonstrated shortly, this checking procedure can be stated as an SOCP
problem. Thus the problem now becomes that how is set, which we can
use divide to handle: if the constraints are satised, we can increase n
accordingly; otherwise, we decrease n .
We detail this solving technique in the following two steps:
(1) Divide: Set n = (0 + m )/2, where 0 is a feasible , m is an infeasible
, and 0 m .
(2) Conquer: Call the Modied Second Order Cone Programming (MSOCP)
procedure elaborated in the following to check whether n is a feasible .
If yes, set 0 = n ; otherwise, set m = n .
In the above, if a satises the constraints of Eqs.(4.2) and (4.3), we call it
a feasible ; otherwise, we call it an infeasible . These two steps are iterated
until |0 m | is less than a small positive value.
We propose the following Theorem 4.1 showing that the MSOCP procedure, namely, the checking problem with xed to a constant n , is solvable
by casting it as an SOCP problem.
Theorem 4.1. The problem of checking whether there exist a w and a b
satisfying the following two sets of constraints Eqs.(4.4) and (4.5) can be
transformed as an SOCP problem which can be solved in polynomial time,
(wT xi + b) n wT x w, i = 1, . . . , Nx ,
(4.4)
(wT y j + b) n wT y w, j = 1, . . . , Ny .
(4.5)
Proof. Introducing dummy variables , we rewrite the above checking problem as an equivalent optimization problem:
max
w=0,b,
Nx +Ny
{ min
k=1
s.t. (wT xi + b) n
k}
wT x w i ,
(wT y j + b) n wT y w j+Nx ,
74
where i = 1, . . . , Nx and j = 1, . . . , Ny .
By checking whether the minimum k at the optimum point is positive,
we can know whether the constraints of Eqs.(4.2) and (4.3) can be satised.
If we go further, we can introduce another dummy variable and transform
the above problem into an SOCP problem:
max
w=0,b, ,
wT x w i ,
(wT y j + b) n wT y w j+Nx ,
s.t. (wT xi + b) n
k ,
wT
Nx
75
xi + Nx b Nx wT x w wT x + b wT x w , (4.6)
i=1
Ny
y j + Ny b) Ny wT y w
j=1
(wT y + b)
wT y w ,
(4.7)
,w=0
s.t.
wT (x y) ( wT x w + wT y w) .
(4.8)
The above optimization is exactly the MPM optimization [11]. Note, however, that the above procedure cannot be reversed. This means that MPM is
a special case of M4 .
Remarks. In MPM, since the decision is completely determined by the global
information, namely, the mean and covariance matrices [11]3 , to assure an accurate performance the estimates of mean and covariance matrices need to
be reliable. However, it cannot always be the case in real world tasks. On
the other hand, M4 seems to solve this problem in a natural way, because
the impact caused by inaccurately estimated mean and covariance matrices
can be neutralized by utilizing the local information, namely by satisfying
those constraints of Eqs.(4.2) and (4.3) for each local data point. This is also
demonstrated in the later experiment.
4.1.2.2 Connection with Support Vector Machine
If one assumes x = y = , the optimization of M4 can be changed as:
max
,w=0,b
s.t. (wT xi + b) wT w ,
(wT y j + b) wT w ,
where i = 1, . . . , Nx and j = 1, . . . , Ny .
Observing that the magnitude of w will not inuence
the optimization,
76
min
w=0,b
wT w,
s.t. (wT xi + b) 1 ,
(wT y j + b) 1 ,
(4.9)
(4.10)
(4.11)
where i = 1, . . . , Nx and j = 1, . . . , Ny .
A special case of the above with = I is precisely the optimization of
SVM, where I is the identity matrix.
Remarks. In the above, two assumptions are implicitly made by SVM: One
is the assumption on data orientation or data shape, i.e. x = y = ,
and the other is the assumption on data scattering magnitude or data
compactness, i.e. = I. However, these two assumptions are inappropriate.
We demonstrate this in Figs. 4.3 and 4.4. We assume the orientation and
the magnitude of each ellipsoid represent the data shape and compactness,
respectively, in these gures.
Fig. 4.3 plots two types of data with the same data orientations but dierent data scattering magnitudes. It is obvious that by ignoring data scattering
SVM is improper to locate itself unbiasedly in the middle of the support vectors (lled points), since x is more possible to scatter on the horizontal axis.
Instead, M4 is more reasonable (see the solid line in this gure). Furthermore,
Fig. 4.4 plots the case with the same data scattering magnitudes but dierent
data orientations. Similarly, SVM does not capture the orientation information. In comparison, M4 grasps this information and demonstrates a more
77
Fig. 4.4. An illustration on that SVM discards the data orientation information
suitable decision plane: M4 represents the tangent line between two small
dashed ellipsoids centered at the support vectors (lled points). Note that
SVM and M4 do not need to achieve the same support vectors. In Fig. 4.4,
M4 contains the above two lled points as support vectors, whereas SVM has
all the three lled points as support vectors.
4.1.2.3 Link with Fisher Discriminant Analysis
FDA, an important and popular method, is used widely in constructing decision hyperplanes and reducing the feature dimensionality. In the following
discussion, we mainly consider its application as a classier. FDA involves
solving the following optimization problem:
|wT (x y)|
max
.
w=0
wT x w + wT y w
Similar to MPM, FDA also focuses on using the global information rather
than considering data both locally and globally. We now show that FDA can
be modied to consider data both locally and globally.
78
max ,
(4.12)
,w=0,b
(wT xi + b)
s.t.
,
wT x w + wT y w
(4.13)
(wT y j + b)
,
wT x w + wT y w
(4.14)
,w=0,b
(4.15)
s.t. wT (x y) wT x w + wT y w .
One can change Eq.(4.15) as:
|wT (xy)|
,
wT x w+wT y w
Nx +Ny
max
C
k ,
(4.16)
,w=0,b,
k=1
(4.17)
s.t. (wT xi + b) wT x w i ,
T
(w y j + b) wT y w j+Nx ,
(4.18)
k 0 ,
79
80
Nx +Ny
max
C
k ,
(4.19)
,w=0,b,
k=1
w xi + b
i ,
s.t.
wT x w
wT y j + b
j+Nx ,
wT y w
k 0 ,
T
(4.20)
(4.21)
(4.22)
where i = 1, . . . , Nx , j = 1, . . . , Ny , and k = 1, . . . , Nx + Ny .
Maximizing Eq.(4.20) contains a similar meaning as minimizing
Nx
+Ny
B
k + 1/2 (B is a positive parameter) in a sense that they both
k=1
attempt to maximize the margin and minimize the error rate. If we conNx
+Ny
sider
k as the residue and regard 1/2 as the regularization term, the
k=1
4
A trick can be made by assuming 1/2 as a new variable and thus the condition
that the regularization is convex can be satised.
81
Nx +Ny
min
,w=0,b,
k ,
(4.23)
k=1
wT xi + b
i ,
s.t.
wT x w
wT y j + b
j+Nx ,
wT y w
A , k 0 ,
(4.24)
(4.25)
(4.26)
Nx
i .
wT x w
i=1
(4.27)
wT x + b
.
i Nx Nx
wT x w
i=1
(4.28)
Similarly, if we expand Eq.(4.25) for each j and add them all together, we
obtain:
Ny
wT y + b
j+Nx Ny + Ny
.
wT y w
j=1
(4.29)
k N Nx
.
wT y w
wT x w
k=1
To achieve minimum training error, namely, min,w=0,b,
(4.30)
Nx
+Ny
k=1
k , we
may consider to minimize its lower bound as specied by the right hand side
of Eq.(4.30). Hence in this case should attain its lower bound A, while the
second part should be as large as possible, i.e.
wT y + b
wT x + b
max
(1 )
,
(4.31)
w=0,b
wT y w
wT x w
where is dened as Nx /N and thus 1 denotes Ny /N . If one further
transforms the above to:
82
max
w=0,b
{t + (1 )s},
wT x + b
t,
s.t.
wT x w
wT y + b
s,
wT y w
(4.32)
(4.33)
(4.34)
one can see that the above optimizes a very similar form as the MEMPM
model except that Eq.(4.33) changes to [6]
min {
w=0,b
t2
s2
+
(1
)
}.
1 + t2
1 + s2
In MEMPM, t2 s2 /(1 + t2 )(1 + s2 ) (denoted as ()) represents the worstcase accuracy for the classication of future x (y) data. Thus MEMPM maximizes the weighted accuracy on the future data. In M4 , s and t represent the
corresponding margin which is dened as the distance from the hyperplane
to the class center. Therefore, it represents the weighted maximum margin
machine in this sense. Moreover, since the function of g(u) = u2 /(1 + u2 )
increases monotonically with u, maximizing the above formulae contains a
physical meaning similar to the optimization of MEMPM in some sense.
Remarks. Implicit constraints are contained for the optimization of the
above derived special case of M4 . Empirically, Eq.(4.27) cannot achieve the
equality in the normal case, since Eqs.(4.24) and (4.25) can only achieve
equalities for support vectors. Moreover, the slack variables are usually far
smaller than . This implies we can consider
wT x + b
> = A.
wT x w
Analogously, for y, a similar statement can be obtained. The presence of
these two constraints is essential, since with the constraints the parameter
is involved in the optimization. Moreover, these two constraints also prevent
the circumstance that the decision hyperplane is extremely far away from one
class center, while being very close to the other class center.
83
1
1
+ (1 )
,
1 + d2x
1 + d2y
where m is the number of support vectors, dx and dy are the corresponding Mahalanobis distances from the class centers x and y to the decision
hyperplane, and is prior probability of the x data. Namely,
1
m
1
E[Perror ] E min
.
(4.35)
+
(1
)
,
N 1 + d2x
1 + d2y
Proof. According to Lemma 4.2, to prove E[Perror ] E[ m
N ], we only need
to show that the number of errors by the leave-one-out method does not
exceed the number of support vectors. Actually, this is the case. If we leave a
non-support vector out and then we perform training on the remaining data,
the decision hyperplane will not change, since the decision hyperplane is just
decided by support vectors and the covariance matrices (statistically, one
point will not inuence the covariance of data). Therefore, this non-support
vector will be recognized correctly. Thus the leave-one-out method classies
correctly all the samples that are not support vectors, i.e. the number of the
leave-one-out errors does not exceed
the number of the support vectors.
1
1
We next prove E[Perror ] E min m
. Accor,
+
(1
)
N
1+d2x
1+d2y
ding to [11, 6, 14], if the means and covariances are reliably estimated,
d2x /(1 + dx 2 ) and d2y /(1 + dy 2 ) represent the worst-case rates in recognizing
correctly the x data and y data respectively. Therefore,
1
1
+ (1 )
1 + d2x
1 + d2y
)
,
N 1 + d2x
1 + d2y
Remarks. Note that the above two items actually represent two meanings
of the M4 model, i.e. minimizing the leave-one-out error presents the contribution by considering the local information from data; on the other hand,
the second item describes the eect by considering the global information
from data. Moreover, if we further examine the second item, dx (dy ) is actually determined by two parts: the Mahalanobis distance from the support
vectors to the corresponding class center x (y) and the margin . This can
be observed in Fig. 4.2. Intuitively, the larger the margin is, the larger dx
and dy are, which leads to a smaller expected test error in the future. This
motivates the margin maximization in the large margin machines.
84
4.3 Reduction
The variables in previous sections are [w, b, 1 , . . . , Nx , . . . , Nx +Ny ], whose
dimension is n + 1 + Nx + Ny . The number of the second order conic constraints is easily veried to be Nx + Ny . This size of the generated constraint
matrix will be a big number and may thus encounter problems in solving
large scale classication tasks. Therefore, we should reduce both the number
of constraints and the number of variables.
Since this problem is caused by the number of the data points, we consider removing some redundant points to reduce both the space and time
complexity. The reduction rule is introduced as follows.
Reduction Rule: Set a threshold [0, 1). In each class, calculate the
Manhalanobis distance di of each point to its corresponding class center. if
d2i /(1 + d2i ) denoted as i is greater than , namely, i , keep this point;
otherwise, remove this point.
The intuition under this rule is that, in general the more discriminant
information the point contains, the further it is from its center (unless it is a
noise point). The inner justication under this rule is from [11]: d2 /(1 + d2 ) is
the worst-case classication accuracy for future data, where d is the minimax
Manhalanobis distance from the class center to the decision hyperplane. Thus
removing those points with small s, namely, d2i /(1 + d2i ) will not aect
the worst-case classication accuracy and will not greatly reduce the overall
performance.
Nevertheless, to cancel the negative impact caused by removing those
points, we add the following global constraint:
wT (x y) ( wT x w + wT y w) .
(4.36)
Integrating the above, we formulate the modied model as follows:
rx +ry
C
max
k + (Nx + Ny rx ry )m
,w=0,b,
k=1
s.t. (w xi + b) ( wT x w) i , i = 1, . . . , rx ,
(wT y j + b) ( wT y w) j+rx , j = 1, . . . , ry ,
wT (x y) ( wT x w + wT y w) m ,
T
m 0,
k 0, k = 1, . . . , rx + ry ,
where, m is the slack variable for the global constraint Eq.(4.36), k are
modied slack variables for the remaining data points, rx is the number of
the remaining points for x, and ry is the number of the remaining points
for y.
4.4 Kernelization
85
Remarks. An interesting observation from the above is that when we set the
reduction threshold to a larger value, or simply to the maximum value 1, the
M4 optimization degrades to the standard MPM optimization. This would
imply that the above modied M4 model contains a worst-case performance
of MPM, if the incorporated local information is useful.
4.4 Kernelization
One may note that in the above, the classier derived from M4 is provided in
a linear conguration. In order to handle nonlinear classication problems,
in this section, we seek to use the kernelization trick [18] to map the ndimensional data points into a high-dimensional feature space Rf , where a
linear classier corresponds to a nonlinear hyperplane in the original space.
The kernel mapping can be formulated as: xi (xi ), y j (y j ),
where i = 1, . . . , Nx , j = 1, . . . , Ny , and : Rn Rf is a mapping function.
The corresponding linear classier in Rf is T (z) = b, where , (z) Rf ,
and b R.
The optimization of M4 in the feature space can be written as:
max
, =0,b
( T (xi ) + b)
s.t.
,
T (x)
( T (y j ) + b)
,
T (y)
(4.37)
i = 1, 2, . . . , Nx ,
j = 1, 2, . . . , Ny .
(4.38)
(4.39)
However, to make the kernel work we need to represent the optimization and
the nal decision hyperplane in a kernel form, K(z 1 , z 2 ) = (z 1 )T (z 2 ),
namely, an inner product form of the mapping data points.
4.4.1 Foundation of Kernelization for M4
In the following, we demonstrate that the kernelization trick indeed works in
M4 , provided suitable estimates of means and covariance matrices are applied
therein.
Corollary 4.4. If the estimates of means and covariance matrices are given
in M4 as the following estimates:
86
(x) =
Nx
i (xi ),
(y) =
i=1
Ny
j (y j ) ,
j=1
(x) = x I n +
Nx
"!
"T
!
i (xi ) (x) (xi ) (x)
,
i=1
(y) = y I n +
Ny
!
"!
"T
j (y j ) (y) (y j ) (y)
,
j=1
,{ p , d }=0,b
s.t.
T
p
T
p
,
( T
p (xi ) + b)
N
x
i=1
i ((xj ) (x))((xi )
(x))T
,
p
x ( T
p p
T
d d)
( T
p (y j ) + b)
N
y
j=1
T
j ((y j ) (y))((y j ) (y))T p + y ( T
p p + d d)
Nx
i (xi ) +
i=1
Ny
j (y j ) ,
j=1
(4.40)
4.4 Kernelization
87
Theorem 4.5. [Kernelization Theorem of M4 ] The optimal decision hyperplane for M4 involves solving the following optimization problem:
max
, =0,b
s.t.
( T K i + b)
T
1
T
Nx K x K x
( T K j+Nx + b)
T
1
T
Ny K y K y
i = 1, 2, . . . , Nx ,
j = 1, 2, . . . , Ny .
Proof. The theorem can easily be proved by simply substituting the plug-in
estimations of means and covariances matrices and Eq.(4.40) into Eqs.(4.38)
(4.39).
The optimal decision hyperplane can be represented as a linear form in
the kernel space:
f (z) =
Nx
i K(z, xi ) +
i=1
Ny
Nx +i K(z, y i ) + b ,
i=1
x , k
y RNx +Ny
k
1Nx RNx
1Ny RNy
:=
K
z i := xi i = 1, 2, . . . , Nx
z i := y iNx i = Nx + 1, Nx + 2, . . . , Nx + Ny
:= [1 , . . . , Nx , 1 , . . . , Ny ]T
T
K i,j := (z
i ) (z j )
K 1,2
...
K 1,Nx +Ny
K 1,1
K 2,1
K
.
.
.
K
2,2
2,N
+N
x
y
Kx :=
.
.
.
.
.
.
.
.
.
.
.
.
K Nx ,1
K Nx ,2
...
K Nx ,Nx +Ny
K Nx +1,2
. . . K Nx +1,Nx +Ny
K Nx +1,1
K Nx +2,1
K
.
.
.
K
N
+2,2
N
+2,N
+N
x
x
x
y
.
Ky :=
.
.
.
.
.
.
.
.
.
.
.
.
. . . K Nx +Ny ,Nx +Ny
K Nx +Ny ,1 K Nx +Ny ,2
Nx
x ]i := 1
[k
j=1 K(xj , z i ) .
Nx
Ny
1
[ky ]i :=
K(y , z i )
Ny
j=1
1i := 1, i = 1, 2, . . . Nx
1i := 1,
i = 1, 2, . .T. N
y
x
K x 1Nx k
K
x
:=
y
T
K
K y 1Ny k
y
88
4.5 Experiments
In this section, we present the evaluation results of M4 in comparison with
SVM and MPM on both synthetic toy datasets and real world benchmark
datasets. SOCP problems are solved based on the general software named
Sedumi [20, 21]. The covariance matrices are given by the plug-in estimates.
4.5.1 Evaluations on Three Synthetic Toy Datasets
We demonstrate the advantages of our approach in comparison with SVM
and MPM in the following synthetic toy datasets rst.
As illustrated in Fig. 4.6, we generate two types of data with the same
data orientations but dierent data magnitudes in Fig. 4.6 (a), while we generate two types of data with the same data magnitudes but dierent data
orientations in Fig. 4.6 (b). In (a), the x data are randomly sampled from
the Gaussian distribution with the mean as [3.5, 0]T and the covariance as
[3, 0; 0, 4.5], while the y data are randomly sampled from another Gaussian
distribution with the mean and the covariance as [3.5, 0]T and [1, 0; 0, 1.5]
respectively. In (b), the x data are randomly sampled from the Gaussian distribution with the mean as [4, 0]T and the covariance as [1, 0; 0, 5], while
the y data are randomly sampled from another distribution with the mean
and the covariance as [4, 0]T and [1, 0; 0, 5] respectively. Moreover, to generate dierent data orientation, in Fig. 4.6 the y data are rotated anti-clockwise
at the angle of 78 . In both (a) and (b), training (test) data consisting of 120
(250) data points for each class are presented as os (+s) and s (s) for x
and y respectively. Observed from Fig. 4.6, M4 demonstrates its advantages
over SVM. More specically, in Fig. 4.6 (a), SVM discards the information of
the data magnitudes, whose decision hyperplane lies basically in the middle
of boundary points of two types of data, while M4 successfully utilizes this
information, i.e. its decision hyperplane lies closer to the compact class (y
data), which is more reasonable. Similarly, in Fig. 4.6 (b), M4 takes advantage of the information of the data orientation, while SVM simply overlooks
this information, which results in a lot of points incorrectly classied.
In comparison of MPM with M4 , since in the above two datasets the global
information, i.e. the mean and the covariance can be reliably estimated from
data, they achieve similar performance. To see the dierence between M4 and
MPM, we generate another dataset as illustrated in Fig. 4.7, where we intentionally generate a very small number of training data, i.e. only 20 training
points. Similarly, the data are generated under two Gaussian distributions:
the x data are randomly sampled from the Gaussian distribution with the
mean as [3, 0]T and the covariance as [0.5, 0; 0, 8], while the y data are
randomly sampled from another distribution with the mean and the covariance as [4, 0]T and [6, 0; 0, 1] respectively. Training data and test data
are represented using similar symbols to Fig. 4.6. From Fig. 4.7, once again
M4 achieves ideal decision boundary which considers data both locally and
4.5 Experiments
89
(a)
(b)
Fig. 4.6. The rst two synthetic toy examples to illustrate M4 . Training
(test) data consisting of 120 (250) data points for each class are presented as
os (+s) and s (s) for x and y respectively. Subgure (a) demonstrates
that SVM omits the data compactness information and (b) demonstrates
that SVM discards the data orientation information, while M4 achieves
ideal decision boundary which considers data both locally and globally
globally; whereas SVM obtains local boundary just in the middle of the support vectors, which discards the global information, namely the statistical
trend of data occurrence. For MPM, its decision hyperplane is exclusively
dependent on the mean and covariance matrices. Thus we can see that this
hyperplane coincides with the data shape, i.e. the long axis of training data of
x is nearly in the same direction as the MPM decision hyperplane. However,
the estimated mean and covariance are inaccurate due to the small number
of data points. This results in a relatively lower test accuracy as illustrated
in Fig. 4.7(b). In comparison, M4 incorporates the information of the local
points to neutralize the eect caused by inaccurate estimations. The test ac-
90
(a)
(b)
Fig. 4.7. The third synthetic toy example to illustrate M4 . Training (test)
data, consisting of 20 (60) data points for each class are presented as os
(+s) and s (s) for x and y respectively. Subgure (a) demonstrates
the decision boundaries derived from training data, while (b) illustrates
the performance of these hyerplanes on the test set. The M4 achieves ideal
decision boundary which considers data both locally and globally
curacies for the above three toy datasets listed in Table 4.2 also demonstrate
the advantages of M4 .
4.5.2 Evaluations on Benchmark Datasets
We perform evaluations on seven standard datasets. Data for Twonorm problem are synthetically generated according to [3]. The remaining six datasets
are real world data obtained from the UCI machine learning repository [2].
We compared M4 with SVM and MPM engaging with both the linear and
Gaussian kernels. The parameter C for both M4 and SVM was tuned via
4.5 Experiments
91
Dataset
SVM
MPM
I(%)
98.8
96.8
98.8
II(%)
98.8
97.2
98.8
III(%)
98.3
97.5
95.8
cross validations [9], so was the width parameter in the Gaussian kernel for
all three models. The nal performance results were obtained via the 10-fold
cross validation. Table 4.3 summarizes the evaluation results.
Table 4.3. Comparisons of classication accuracies among M4 , SVM, and MPM
Dataset
Twonorm
SVM
MPM
M4
SVM
MPM
96.5 0.6
95.1 0.7
97.6 0.5
96.5 0.7
96.1 0.4
97.6 0.5
Breast
96.9 0.8
96.9 0.8
Ionosphere
84.8 0.8
92.3 0.6
Pima
76.1 1.2
86.5 1.1
76.2 1.2
Sonar
75.5 1.1
87.3 0.8
Vote
94.8 0.4
94.6 0.4
83.2 0.8
83.1 1.0
From the results we observe that M4 achieves the best overall performance. In comparison with SVM and MPM, M4 wins ve cases in the linear
kernel and four in the Gaussian kernel. The evaluations on these standard
bench-mark datasets demonstrate that it is worth considering data both locally and globally, which is emphasized in M4 . Inspecting the dierences
between M4 and SVM, the kernelized M4 appears marginally better than
the kernelized SVM, while the linear M4 demonstrates a distinctive advantage over the linear SVM. This phenomenon may be explained on two hands.
On one hand, this can be explained from the fact that the data points are
very sparse in the kernelized space or feature space (compared with the huge
dimensionality in the Gaussian kernel). Thus the plug-in estimates of the
covariance matrices may not accurately represent the data information in
this case. On the other hand, it is well-known that the kernelization will not
keep the structure information in the feature space. One direct consequence
is that maximizing the margin in the feature space does not necessarily max-
92
imize the margin in the original space [23]. Therefore, without building some
connections between the original space and the feature space, utilizing the
structure information, e.g. covariance matrices in the feature space seems not
to do much help in this sense. Inspecting these two points, one interesting
topic in the future is to consider forcing constraints on the mapping function
so as to maintain the data topology in the kernelization process.
In the above, we do not perform the reduction on these datasets. To illustrate how the reduction algorithm works for decreasing the computation time
while maintaining the test accuracy, we implement it on the Heart-disease
dataset. We perform the reduction in training sets and then keep test sets unchanged. We repeat this process for dierent thresholds . We then plot the
curve of the cross validation accuracy against the threshold . Moreover, we
also plot the curve of the computation time against the threshold. This can
be seen in Fig. 4.8. From this gure, we can see that both that the computation time and the test accuracy change insensitively against when is set
to some small values, e.g. 0.7. If looking into the Heart-disease dataset,
we nd that most data points are far away from their corresponding class
center in terms of the Manhalanobis distance. Thus setting small values to
does not actually reduce many data points. This generates both a relatively
at changing curve in the test accuracy and the computation time in this
range. As is changing larger, the computation time decreases fast as more
and more data points are removed, while the test accuracy goes down slowly.
When the threshold is set to 1, the M4 degrades to the MPM model, yielding
the test accuracy of M4 achieves the same value of MPM. This demonstrates
how the proposed reduction algorithms can decrease the computation time
while maintaining good performance. When used in practice, the threshold
can be set according to the required response time.
4.7 Summary
93
4.7 Summary
Local learning approaches, e.g. large margin machines have demonstrated
their advantages in machine learning and pattern recognition. However, they
derive the decision boundary only in a local way. For example, the most popular large margin classier, Support Vector Machine obtains the decision hyperplane by focusing on considering some critical local points called support
vectors, while discarding all other points; on the other hand, global learning
models (e.g. Minimax Probability Machine) obtain the classier only based
on global information, i.e. the mean and covariance information in MPM,
while ignoring all individual local points. Dierently, our proposed model
is constructed based on both domestic and global view of data. This new
model is theoretically important in the sense that SVM and MPM can both
be considered as its special cases. Furthermore, the optimization of M4 can
be cast as a sequential Conic Programming problem which can be solved in
polynomial time.
We have provided a clear geometrical interpretation, and established detailed connections among our model and other models such as Support Vector
Machine, Minimax Probability Machine, Fisher Discriminant Analysis, and
Minimum Error Minimax Probability Machine. We have also shown to exploit
94
References
Mercer kernels to extend our model to build up nonlinear decision boundaries. In addition, we have also proposed a reduction method to decrease
the computation time. Experimental results on both synthetic datasets and
real world benchmark datasets have demonstrated the advantages of M4 over
Support Vector Machine and Minimax Probability Machine.
References
1. Bertsekas DP (1999) Nonlinear Programming. Belmont, MA: Athena Scientic
2nd edition
2. Blake CL, Merz CJ (1998) Repository of machine learning databases, University
of California, Irvine, http://www.ics.uci.edu/mlearn/MLRepository.html
3. Breiman L (1997) Arcing classiers. Technical Report 460, Statistics Department, University of California
4. Fukunaga K(1990). Introduction to Statistical Pattern Recognition. San Diego,
CA: Academic Press, 2nd edition
5. Huang K, Yang H, King I, Lyu MR (2004) Learning large margin classiers
locally and globally. In the 21st International Conference on Machine Learning
(ICML-2004)
6. Huang K, Yang H, King I, Lyu MR, Chan L (2004) The minimum error
minimax probability machine. Journal of Machine Learning Research 5:1253
1286
7. Huang K, Yang H, King I, Lyu MR, Chan L (2007). Maxi-Min Margin Machine:
Learning large margin classiers globally and locally. To appear in IEEE Trans.
Neural Networks
8. Ivannov VV (1962) On linear problems which are not well-posed. Soviet Math.
Docl. 3(4):981983
9. Kohavi R (1995). A study of cross validation and bootstrap for accuracy estimation and model selection. In Proceedings of the Fourtheenth International Joint
Conference on Articial Intelligence (IJCAI-1995). San Francisco, CA:Morgan
Kaufmann 338345
10. Kruk S, Wolkowicz H (2000) General nonlinear programming. In H. Wolkowicz,
R. Saigal, and L. Vandenberghe, editors, Handbook of Semidenite Programming: Theory, Algorithms, and Applications. Boston, MA: Kluwer Academic
Publishers 563575
11. Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI (2002) A robust
minimax approach to classication. Journal of Machine Learning Research
3:555582
12. Lobo M, Vandenberghe L, Boyd S, Lebret H (1998) Applications of second
order cone programming. Linear Algebra and its Applications 284:193228
13. Luntz A, Brailovsky V(1969) On estimation of characters obtained in statistical
procedure of recognition (in Russian). Technicheskaya Kibernetica 3(6)
14. Marshall AW, Olkin I (1960) Multivariate Chebyshev inequalities. Annals of
Mathematical Statistics 31(4):10011014
15. Nesterov Y, Nemirovsky A (1994) Interior point polynomial methods in convex
programming: Theory and applications. Philadelphia, PA: SIAM
16. Platt J(1998) Sequential minimal optimization: A fast algorithm for training
support vector machines. Technical Report MSR-TR-98-14
References
95
5
Extension I: BMPM for Imbalanced Learning
98
99
,,b,w=0
s.t.
inf
Pr {wT x b} ,
(5.1)
inf
Pr {wT y b} ,
(5.2)
x(
x, x )
y(y,
y)
0 .
(5.3)
Here means the lower bound of the probability (accuracy) for the classication of future cases of the class x with respect to all distributions with the
mean and covariance as (x, x ); in other words, is the worst-case accuracy
for the class x. Similarly, is the lower bound of the accuracy of the class y.
This optimization achieves to maximize the accuracy (the probability ) for
the biased class x while simultaneously maintaining the class ys accuracy at
an acceptable level 0 by setting a lower bound as Eq.(5.3). In comparison,
the Minimax Probability Machine (MPM) in [16, 17] considers the balanced
dataset; therefore, it makes equal to .
This optimization setting seems to be more useful in incorporating a bias
into classications for imbalanced learning problems. A typical example can
be seen in the epidemic disease diagnosis problem which is usually an imbalanced classication problem as well. The ill cases are usually much fewer
than the healthy cases. However, misclassication of the ill class results in
more serious consequence than misclassication of the healthy case. Thus
an unequal treatment on dierent classes is obviously necessary.
We summarize the advantages of our biased model in the following. First,
this method provides a dierent treatment on dierent classes, i.e. the hyperplane w T z = b given by the solution of this optimization favors the
classication of the important class x over the less important class y. Second, given reliable mean and covariance matrices, the derived decision hyperplane is directly associated with two real accuracy indicators, i.e. and
, for each class. Thus, by varying the lower bound of , i.e. 0 and deriving
the corresponding classier, we can quantitatively incorporate a bias into the
classication. Third, this model contains a distribution-free feature. With no
distribution assumption on data, the derived hyperplane seems to be more
general and valid than a large family of classiers, namely the generative classiers [10, 12] including the Naive Bayesian classier [18], which has to make
1
Note that, for easy explanations, the model description is in the slightly dierent but essentially the same form as the one introduced in Chapter 3.
100
specic distribution assumptions. Fourth, as shown shortly in Section 5.3, either we can simply modify this BMPM optimization to automatically search
the best 0 in terms of some standard criteria, or slightly dierent from the
current setting, we can quantitatively generate the trade-o curve between
the accuracies on dierent classes and leave the task of choosing the best 0
to the users. Finally, although the BMPM contains the above advantages, it
does not trade them for eciency. It is shortly shown that the optimization of
BMPM can be cast as a Fractional Programming (FP) problem and thus can
be solved eciently. In short, with these important features, BMPM appears
to oer a more direct and rigorous scheme to handle biased classication
tasks, especially the imbalanced classications, where the importance or cost
for each class is unequal.
(5.4)
where Fp is the number of the false positive, CFp is the cost of a false positive,
Fn is the number of the false negative, and CFn is the cost of a false negative. However, because the cost of misclassication is generally unknown in
101
real cases, the usage of this measure is somewhat restricted. Considering this
point, some researchers introduced the ROC analysis [25, 26, 34]. This criterion plots a so-called ROC curve to visualize the tradeo between the false
positive rate and the true positive rate and leaves the task of the selection
of a specic tradeo to the practitioners. Fig. 5.1 illustrates an articially
generated ROC curve. It has been suggested that the area beneath an ROC
curve can be used as a measure of accuracy in many applications [30, 33].
Thus, a good classier for imbalanced learning should have a larger area.
Based on the above review, in this chapter we will focus on using the
criterion of MS and the ROC curve analysis to evaluate the classiers.
5.3.2 BMPM for Maximizing the Sum of the Accuracies
In the following, we rst modify the formulation of BMPM to maximize the
sum of the accuracies for two classes. Next, we make an analysis on the
solvability of the modication version. Finally, we present the optimization
method.
5.3.2.1 Model Modication
When using BMPM for the criterion of MS, we can modify the formulation
of BMPM as follows:
max
( + ) ,
(5.5)
inf
Pr {wT x b} ,
(5.6)
inf
Pr {wT y b} .
(5.7)
,,b,w=0
s.t.
x{x, x }
y{y, y }
102
The above formulate directly maximize the sum of the lower bounds of the
accuracies so as to maximize the sum of the accuracies. In comparison, to
achieve the maximum sum of the accuracies, some other approaches, e.g. the
methods of sampling or the methods of adapting the weights have to search
the best sampling proportion or the best weights by trials, which are in
general very time-consuming. Since the above optimization is in fact nearly
the same as the Minimum Error Probability Machine, it can be similarly
solved by the Sequential Biased Minimax Probability Machine optimization
method as introduced in Chapter 3. We thus do not elaborate it here.
5.3.3 BMPM for ROC Analysis
It is straightforward to apply the BMPM model to plot the ROC curve, since
the lower bounds and directly and quantitatively control the accuracies
for two classes. We only need to adapt the acceptable level for , namely
0 , from 0 to 1, to obtain a sequence of trade-os between the accuracies
of the important class and the negative class. We address that again, since
0 represents the lower bound of the accuracy of the less important class,
varying 0 provides a direct and quantitative way to move the decision plane
with dierent trade-os. Directly associating accuracies with the moving of
the hyperplane while assuming no distribution is one of advantages of BMPM
over the other methods by adapting the weights or thresholds.
103
Fig. 5.2. A toy example to illustrate BMPM. Data of the class x is plotted
as +s, and data of class y as s. The gray area represents the classication
region of the class x, while the area outside the gray region is classied as
the class y
as the class y plotted as s. It is clear to observe that the lower bound 0
directly controls the accuracy of the class y. More specically, when 0 is set
to small values such as 10.00%, 60.00% and 95.00%, the boundary is biased
towards the class x. When 0 is set to larger values such as 99.00%, the
classication is biased towards the class y. Moreover, Table 5.1 demonstrates
that the lower bounds 0 and can serve as the accuracy indictors. It is
observed that these lower bounds keep well, i.e. the corresponding accuracies
are slightly higher than the lower bounders except in the case when 0 =
0.95. The exception, i.e. that the value of , 99.16% is greater than the real
accuracy 93.33%, is understandable due to the relatively smaller number of
training samples: one single misclassication will inuence the classication
results signicantly. This toy example demonstrates that by changing 0 ,
104
13.85
100.00
100.00
60.00
63.08
100.00
100.00
95.00
95.38
99.16
93.33
99.00
100.00
81.94
86.67
BMPM provides an elegant and direct way to incorporate the bias into the
classication.
5.4.2 Evaluations on Real World Imbalanced Datasets
In this section, we evaluate our novel BMPM model in comparison with three
competitive classication methods, namely the Naive Bayesian classier, the
k-Nearest Neighbor methods and the decision tree C4.5, on two real world
imbalanced datasets, the recidivism dataset and the rooftop dataset. Before
we go into the experimental details, we rst introduce these three techniques
and adapt them to learn from imbalanced datasets according to previous
research results [20, 26].
5.4.2.1 Modifying Three Learning Techniques
We investigate and modify three learning techniques, the Naive Bayesian
classier, the k-Nearest Neighbor method, and the decision tree C4.5 in the
following.
The Naive Bayesian classier [11, 18] is proposed based on a very simple assumption, i.e. each attribute is conditionally independent of each
other when given the class variable. The decision in a two-category prediction task is made according to the calculation of the posterior probability
p(C|z), where C is the class variable and z represents the observation. When
p(C1 |z) 0.5 or another equivalent yet more convenient rule is satised,
i.e. p(C1 )p(z|C1 ) p(C2 )p(z|C2 ), z is classied into C1 ; otherwise, it is
judged as C2 . Even with the strong conditional independency assumption,
the Naive Bayesian classier demonstrates a surprisingly good performance
when compared with state-of-the-art classiers [8, 19] such as Support Vector
Machines [35] and C4.5 in many domains. By simply introducing a parameter
into the decision rule p(C1 )p(z|C1 ) p(C2 )p(z|C2 ), Naive Bayesian classiers can be adapted to the imbalanced learning. For example, specifying
< 1 imposes a bias towards the C1 class, whereas specifying > 1 imposes
a bias towards the C2 class.
In the k-Nearest Neighbor classication [1], based on some distance measure, e.g. the Euclidean distance measure, k data points, which are the closest to the query point, are selected out. It then labels the query point as
105
the most frequent class among the chosen k points. Although this method is
very simple and may suer from diculties in high dimensions, it achieves
satisfactory performance in many real domains. Following [26], we alter the
distance measure j for the class Cj to handle imbalanced learning tasks
according to Eq.(5.8):
j = dE (z, z j ) j dE (z, z j ) ,
(5.8)
where z j is the closest point from class Cj to the query point, and dE (z, z j )
represents the Euclidean distance measure. Similar to the Naive Bayesian
classier, by modifying j the Nearest Neighbor method can build biased
classiers.
C4.5 is a kind of algorithm introduced by Quinlan for inducing classication models, also called decision trees, from data [31]. By selecting the
attributes according to the gain ratios criterion, an information measure of
homogeneity, C4.5 builds up a decision tree where each path from the root
to a leaf represents a specic classication rule. We adapt C4.5 to learn from
imbalanced dataset based on the similar method to [26], i.e. by changing the
prior probability to bias the classication.
5.4.2.2 Evaluations on the Recidivism Dataset
The recidivism dataset was obtained from a cohort of releases of the North
Carolina prison system during the time period from July 1, 1977 to June
30, 1978. There are totally 4, 618 individuals in this dataset, including a
training set with 1, 540 individuals and a test set with 3, 078 individuals. In
the training set, 570 (27.5%) individuals were recidivists and 970 (72.5%) were
not. In the test set, 1, 151 individuals were recidivists and 1, 927 were not.
Although this dataset is not skewed as severely as other reported datasets,
for example, the fog dataset [28] and the rooftop dataset used in the next
subsection, it is enough to use this dataset to evaluate the performance of
the imbalanced learning [26].
We use the same processing method [32] to select and scale nine attributes
that appear in Table 5.2, while six other attributes are dropped based on an
insignicant test at the 5% level.
We compare the performance of our proposed Biased Minimax Probability Machine model, in both the linear (BMPML) and the Gaussian kernel
setting (BMPMG), with the Naive Bayesian classier, C4.5 and the k-Nearest
Neighbor method. These methods are modied into the imbalanced learning
according to the methods introduced in the previous section. We run k-NN
methods for k = 1, 3, 5, . . . , 21, but we only present the best three results
for brevity. The width parameter for the Gaussian kernel is tuned via cross
validation methods [13].
We rst present the experimental results based on the MS criterion in
Table 5.3. To be more comparable, we show the average of the accuracy for
106
Description
TSERVED
AGE
PRIORS
WHITE
FELON
LCHY
JUNKY
PROPTY
MALE
each class when each classier attains the point of the maximum sum. The
BMPML achieves an average accuracy of 0.6391 and the BMPMG achieves an
average accuracy of 0.6490, while the highest average accuracy among other
classiers is given as 0.6272 by NB. Therefore, in this dataset, BMPML and
BMPMG outperform other methods in terms of the MS criterion.
Table 5.3. Performance on a recidivism prediction task based on the MS
criterion
Method True negative rate True positive rate (True positive rate+true negative rate)/2
NB
0.6177
0.6377
0.6272
k-NN(9)
0.6255
0.5464
0.5860
k-NN(11)
0.6238
0.5542
0.5890
k-NN(13)
0.5569
0.6201
0.5885
C4.5
0.7405
0.4900
0.6153
BMPML
0.7037
0.5745
0.6391
BMPMG
0.7203
0.5778
0.6490
Let us next present the experimental results based on the ROC analysis. By setting the thresholds or costs by trials for NB, k-NN, and C4.5, the
ROC curves are generated with good shapes as evenly distributed along their
length as possible. As discussed in [26], although this generation method may
increase the running time for some methods, e.g. k-NN, it works well in C4.5
and NB and is sucient to evaluate the performance of imbalanced learning.
For the BMPM model, since the lower bound 0 serves as the accuracy indicators, we simply vary it from 0 to 1 to generate the corresponding ROC
curve. The ROC curves are shown in Fig. 5.3(a). As seen in this gure, the
performances of BMPML and BMPMG are once again superior to those of
107
Fig. 5.3. ROC curves for the recidivism dataset. Subgure (a)
shows a full range of the ROC curve, while (b) shows a critical
proportion of the ROC curve, which is of more interest in real applications. Both gures demonstrate the superiority of the BMPM
model, since the curves of BMPML and BMPMG cover those of
other models in most parts and thus have a larger area
other methods, since their ROC curves cover those of other models in most
parts. To quantitatively demonstrate the dierence, in Table 5.4 we also show
the areas beneath the ROC curves approximated by using the trapezoid rule.
The BMPML and BMPMG show a consistent superiority to NB which is the
best of the other three methods.
In addition, in real applications not all the portions of the ROC curve are
of great interest [27]. Usually, those with a small false positive rate and a high
true positive rate should be more of interest and importance [36]. We thus
108
NB
0.6646
k-NN(11)
0.6155
k-NN(13)
0.6189
k-NN(17)
0.6148
C4.5
0.6383
BMPML
0.6842
BMPMG
0.6798
especially show the portion of the ROC curve in the range when the false
positive rate FP [0, 0.5] and the true positive rate TP [0.5, 1]. As shown
in Fig. 5.3(b), in this range, the superiority of the BMPL and BMPMG is
more obvious than the whole ROC curve analysis. This again demonstrates
our models advantages over other methods.
5.4.2.3 Evaluations on the Rooftop Dataset
The rooftop dataset consists of 17, 829 overhead images of Fort Hood, Texas,
collected as part of the RADIUS project [7], which are of a military base.
Depending on whether they are buildings (with a detected rooftop) or not,
781 images in this dataset are labeled as positive examples while 17, 048
images are labeled as negative examples. It is clearly observed that this is
a severely skewed dataset. According to [7, 26], these images were taken
from two dierent viewpoints, i.e. a nadir aspect and an oblique aspect and
covered three dierent areas. Following [21, 26], we represent each of these
images in nine continuous attributes which are extracted based on various
image analysis. The detailed information about this dataset is summarized
in Tables 5.5 and 5.6.
Table 5.5. Description of images in the rooftop dataset
Sub-dataset
Location
Aspect
#Positive
#Negative
Image size
2055 375
Nadir
71
2645
1803 429
Oblique
74
3349
670 645
Nadir
197
982
704 568
Oblique
238
1955
1322 642
Nadir
87
3722
1534 705
Oblique
114
4395
109
Description
We randomly split the rooftop data into a training set with 60% data and
a test set with 40% data. We then construct classiers from imbalanced data
based on the training dataset and perform evaluations on the test dataset.
We repeat this procedure ten times and use the average of the results as the
performance metric. In such a setup, we compare our BMPM with other three
approaches, i.e. NB, C4.5 and k-NN. Similar to the case in the recidivism
dataset, NB, C4.5 and k-NN are modied to handle imbalanced data. The
width parameter is chosen by cross validation methods again. Moreover, we
still run k-NN with k = 1, 3, 5, ..., 21 and present the best three for brevity.
The results are summarized in Table 5.7 based on the MS criterion, and
Table 5.7. Performance on the rooftop dataset based on the MS criterion
Method True negative rate
BMPML
0.8015 0.0058
0.8231 0.0063
0.8123 0.0060
BMPMG
0.7997 0.0087
0.8405 0.0100
0.8201 0.0091
k-NN(7)
0.7510 0.0055
0.8069 0.0062
0.7789 0.0052
k-NN(13)
0.7409 0.0051
0.8140 0.0083
0.7774 0.0061
k-NN(15)
0.7433 0.0067
0.8211 0.0072
0.7822 0.0072
NB
0.7969 0.0043
0.8177 0.0080
0.8073 0.0066
C4.5
0.8176 0.0040
0.7942 0.0063
0.8059 0.0051
Fig. 5.4 and Table 5.8 based on the ROC analysis. As is clearly observed, for
both criteria, the BMPM method demonstrates its superiority to the other
methods, since it has higher sums of the accuracies and larger areas under the
ROC curves. Similar to what we do in the recivisim dataset, we also plot the
more critical portion of the ROC curve in Fig. 5.4(b). The predominance of
BMPML and the BMPMG is even more obvious. To evaluate the performance
more reliably, we perform a signicance test based on both LabMRMC [5, 24]
110
and a t-test. The analysis shows that the accuracies of BMPML and BMPMG
are signicantly dierent from those of other methods at P 0.05, both in
terms of the MS criterion and the ROC curve criterion.
Fig. 5.4. ROC curves for the rooftop dataset. We ran each method by
randomly partitioning the dataset into a training dataset (60%) and a test
dataset (40%). The evaluations were iterated 10 times. We then average
the true positive rate and false positive rate to generate the ROC curves.
Subgure (a) shows a full range of the ROC curve, while (b) shows a critical
proportion of the ROC curve, which is of more interest in real applications.
Both gures demonstrate the superiority of the BMPML and BMPMG
model to other models, since the curves of BMPML and BMPMG cover
those of other models in most parts and thus have a larger area
111
BMPML
0.8791 0.0061
BMPMG
0.8819 0.0087
k-NN(9)
0.8601 0.0091
k-NN(11)
0.8569 0.0058
kNN(15)
0.8582 0.0063
NB
0.8678 0.0060
C4.5
0.8744 0.0062
112
Specicity
Sensitivity
(Specicity+Sensitivity)/2
BMPML
0.9684 0.0029
0.9872 0.0015
0.9778 0.0021
BMPMG
0.9612 0.0018
0.9915 0.0011
0.9764 0.0016
k-NN(11)
0.9900 0.0047
0.9620 0.0034
0.9760 0.0029
k-NN(17)
0.9862 0.0081
0.9664 0.0058
0.9762 0.0050
k-NN(7)
0.9721 0.0071
0.9752 0.0049
0.9737 0.0058
NB
0.9366 0.0059
0.9719 0.0049
0.9543 0.0051
C4.5
0.9378 0.0074
0.9582 0.0067
0.9480 0.0072
Specicity
Sensitivity
(Specicity+Sensitivity)/2
BMPML
0.8549 0.0042
0.8158 0.0013
0.8354 0.0035
BMPMG
0.8403 0.0053
0.8572 0.0017
0.8488 0.0026
k-NN(17)
0.7654 0.0029
0.8837 0.0018
0.8246 0.0027
k-NN(7)
0.7754 0.0038
0.8844 0.0042
0.8299 0.0037
k-NN(15)
0.7512 0.0028
0.8653 0.0037
0.8082 0.0036
NB
0.7862 0.0052
0.8024 0.0031
0.7943 0.0040
C4.5
0.8831 0.0022
0.7065 0.0018
0.7948 0.0021
the Gaussian kernel, whereas the k-NN with k = 11 forms a curve with a
smaller area equal to 0.9908, the best result of the k-NN, NB and C4.5. For
the Heart disease dataset, the BMPM shows a curve with an area of 0.8814
in the linear setting and a curve with an area of 0.8932 in the Gaussian kernel
setting. These two areas are both greater than those of the other methods,
i.e. the k-NN classier, NB and C4.5. In summary, the evaluations based on
the area of the ROC curve quantitatively demonstrate the superiority of our
BMPM model for both datasets.
In addition, as illustrated in Fig. 5.5(b) and Fig. 5.6(b), we show the
critical portion of Fig. 5.5(a) and Fig. 5.6(a) respectively when the false
positive rate is in the range of 0.0 to 0.5 and the true positive rate is in
the range of 0.5 to 1.0. In this critical region, most parts of the ROC curves
of BMPM cover the corresponding curves of other models in both datasets,
which again demonstrates the superiority of the BMPM model.
Heart
0.9953 0.0018
0.8814 0.0056
0.8932 0.0043
BMPML
0.8689 0.0050
k-NN(7)
NB
C4.5
0.9762 0.0120
0.8301 0.0038
Fig. 5.5. ROC curves for the breast-cancer dataset. The ROC
curves of BMPML and BMPMG dominate those of other models
and BMPMG yields the largest area under the ROC curve
113
114
Fig. 5.6. ROC curves for the heart disease dataset. The ROC
curves of BMPML and BMPMG dominate those of other models
and BMPMG yields the largest area under the ROC curve
References
115
,,b,w=0
s.t.
(Kx + Ky ) ,
inf
Pr {wT x b} ,
inf
Pr {wT y b} .
x{x, x }
y{y, y }
5.6 Summary
In this chapter, we have applied a novel model named Biased Minimax Probability Machine to deal with the task of learning from imbalanced datasets.
Given reliable estimation of the mean and covariance of data, this model constructs the classication boundary by directly controlling the lower bound of
the real accuracy and thus provides a systematic and rigorous treatment
on skewed data. We have evaluated the BMPM model on two real world
imbalanced datasets and two disease datasets in terms of two criteria. In
both criteria, the performances are shown to be the best when compared
with other competitive methods such as the Naive Bayesian classier, the
k-Nearest Neighbor method, and the decision tree classier, C4.5.
References
1. Aha D, Kibler D, Albert M (1991) Instance-based learning algorithms. Machine
Learning 6:3766
2. Bradley A (1997) The use of the area under the ROC curve in the evaluation
of machine learning algorithm. Pattern Recognition 30(7):11451159
3. Cardie C, Howe N (1997) Improving minority class prediction using case specic
feature weights. In Proceedings of the Fourteenth International Conference on
Machine Learning (ICML-1997). San Francisco, CA: Morgan Kaufmann 5765
4. Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) Smote: synthetic minority
over-sampling technique. Journal of Articial Intelligence Research 16:321357
5. Dorfman K, Berbaum D, Metz C (1992) Receiver operating characteristic
rating analysis: generalization to the population of readers and patients with
the jackknife method. Investigative Radiology 27:723731
116
References
6. Dori D, Liu W (1999) Sparse pixel vectorization: An algorithm and its performance evaluation. IEEE Trans. Pattern Analysis and Machine Intelligence
21:202215
7. Firschein O, Strat T (1996) RADIUS: Image understanding for imagery intelligence. San Francisco, CA: Morgan Kaufmann
8. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classiers.
Machine Learning 29:131161
9. Grzymala-Busse JW, Goodwin LK, Zhang X (2003) Increasing sensitivity of
preterm birth by changing rule strengths. Pattern Recognition Letters 24:903
910
10. Huang K, King I, Lyu MR (2003) Discriminative training of Bayesian chow-liu
tree multinet classiers. In Proceedings of International Joint Conference on
Neural Network (IJCNN-2003), Oregon, Portland, U.S.A. 1:484488
11. Huang K, King I, Lyu MR (2003) Finite mixture model of bound semi-naive
Bayesian network classier. In Proceedings of the International Conference on
Articial Neural Networks (ICANN-2003), Lecture Notes in Articial Intelligence, Long paper. Heidelberg: Springer-Verlag 2714:115122
12. Jaakkola TS, Haussler D (1998) Exploiting generative models in discriminative
classiers. In Advances in Neural Information Processing Systems (NIPS)
13. Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. In Proceedings of the Fourteenth International
Joint Conference on Articial Intelligence (IJCAI-1995). San Francisco, CA:
Morgan Kaufmann 338345
14. Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil
spills in satellite radar images. Machine Learning 30(2-3):195215
15. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets:
One-sided selection. In Proceedings of the Fourteenth International Conference
on Machine Learning (ICML-1997). San Francisco, CA: Morgan Kaufmann
179186
16. Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI (2001) Minimax
probability machine. In Advances in Neural Information Processing Systems
(NIPS)
17. Lanckriet GRG, Ghaoui LE, Bhattacharyya C, Jordan MI (2002) A robust
minimax approach to classication. Journal of Machine Learning Research
3:555582
18. Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classiers. In
Proceedings of National Conference on Articial Intelligence 223228
19. Lerner B, Lawrence ND (2001) A comparison of state-of-the-art classication
techniques with application to cytogenetics. Neural Computing and Applications 10(1):3947
20. Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised
learning. In Proceedings of the Eleventh International Conference on Machine
Learning (ICML-1994). San Francisco, CA: Morgan Kaufmann 148156
21. Lin C, Nevatia R (1998) Building detection and description from a single
intensity image. Computer Vision and Image Understanding 72:101121
22. Ling C, Li C (1998) Data mining for direct marketing:problems and solutions.
In Proceedings of the Fourth International Conference on Knowledge Discovery
and Data Mining (KDD-1998). Menlo Park, CA: AAAI Press 7379
23. Liu W, Dori D (1997) A protocol for performance evaluation of line detection
algorithms. Machine Vision and Application 9:240250
References
117
24. Maloof MA (2002) On machine learning, ROC analysis, statistical tests of signicance. In Proceedings of the Sixteenth International Conference on Pattern
Recognition. Los Alamitos, CA: IEEE Press 204207
25. Maloof MA (2003) Learning when data sets are imbanlanced and when costs are
unequal and unknown. In Proceedings of International Conference on Machine
Learning (ICML-2003)
26. Maloof MA, Langley P, Binford TO, Nevatia R, Sage S (2003) Improved rooftop
detection in aerial images with machine learning. Machine Learning 53:157191
27. Mcclish D (1989) Analyzing a portion of the ROC curve. Medical Decision
Making 9(3):190195
28. Nugroho AS, Kuroyanagi S, Iwata A (2002) A solution for imbalanced training
sets problem by combnet and its application on fog forecasting. IEICE TRANS.
INF. & SYST, E85-D(7)
29. Provost F (2000) Learning from imbanlanced data sets. In Proceedings of the
Seventeenth National Conference on Articial Intelligence (AAAI 2000)
30. Provost F, Fawcett T (1997) Analysis and visulization of classier performance:
comparison under imprecise class and cost distributions. In Proceedings of
the Third International Conference on Knowledge Discovery and Data Mining.
Menlo Park, CA: AAAI Press 4348
31. Quinlan JR (1993) C4.5: Programs for Machine Learning. San Mateo, CA:
Morgan Kaufmann Publishers
32. Schmidt P, Witte A (1988) Predicting Recidivism Using Survival Models. New
York, NY: Spring-Verlag
33. Swets J (1988) Measureing the accuracy of diagnostic systems. Science
240:12851293
34. Swets J, Pickett R (1982) Evaluation of Diagnoistic Systems: Methods from
Signal Detection Theory. New York, NY: Springer-Verlag
35. Vapnik VN (1999) The Nature of Statistical Learning Theory. New York, NY:
Springer-Verlag, 2nd edition
36. Woods K, Kegelmeyer Jr WP, Bowyer K (1997) Combination of multiple classiers using local accuracy estimates. IEEE Tansactions on Pattern Analysis
and Machine Intelligence 19(4):405410
6
Extension II: A Regression Model from M4
In this chapter, we present a novel regression model which is directly motivated from the Maxi-Min Margin Machine(M4 ) model described in Chapter 4.
Regression is one of the problems in supervised learning. The objective is to
learn a model from a given dataset, {(x1 , y1 ), . . . , (xN , yN )}, and then based
on the learned model, to make accurate predictions of y for future values of x.
Support Vector Regression (SVR), a successful method in dealing with this
problem contains the good generalization ability [20, 17, 8, 6]. The standard
SVR adopts the 2 -norm to control the functional complexity and chooses an
-insensitive loss function with a xed tube (margin) to measure the empirical risk. By introducing the 2 -norm, the optimization problem in SVR can
be transformed to a quadratic programming problem. On the other hand, the
-tube has the ability to tolerate noise in data and xing the tube enjoys the
advantages of simplicity. These settings are in a global fashion and are eective in common applications, but they lack the ability and the exibility to
capture the local trend in some applications. For example, in stock markets,
the data are highly volatile and the associated variance of noise varies over
time. In such cases, xing the tube cannot capture the local trend of data
and cannot tolerate the noise adaptively.
One typical illustration can be seen in Fig. 6.1. In this gure, the data
contain larger noise as the x value of the data becomes larger. However, the
SVR cannot exibly and suitably handle it. As shown in Fig. 6.1(a), with a
xed -margin (set to 0.04) SVR considers the data globally and equally: The
derived approximating function in SVR deviates from the actual data trend.
On the other hand, as illustrated in Fig. 6.1(b), if we adequately consider
the local volatility of data by adaptively and automatically setting a small
margin in low volatile regions and a larger margin in high volatile areas, the
resulting approximating function (the solid line in Fig. 6.1(b)) would be more
suitable and reasonable.
Targeting to solve these problems, we propose the Local Support Vector
Regression (LSVR) model. We will show that with consideration of the local
120
Fig. 6.1. Illustration of the -insensitive loss function with xed and nonxed margins in the feature space. In (b), a non-xed margin setting is more
reasonable. It can moderate the eect of the noise by enlarging (shrinking)
the margin width in the local area with large (small) variance of noise
data trend, our model provides a systematic and automatic scheme to locally
and exibly adapt the margin. Moreover, we will also demonstrate that this
novel LSVR model can derive special cases, containing a very similar physical
meaning to the standard SVR. Another critical feature of our model is that
the associated optimization of LSVR can be cast as a Second Order Cone
Programming (SOCP) problem which can be eciently solved in polynomial
time [11]. The margin setting in the novel LSVR model is dierent from that
in our previous work [21]. Concretely, the tube here is adapted directly based
on the functional complexity and the local trend of data. This hence provides
a more systematic and more rigorous way to moderate the margin automatically. This model can be seen as an extension to the regression model of
M4 . In M4 , the main purpose is to build a classication boundary for dierent classes, while in LSVR the goal is to model a function approximating the
data. Therefore, M4 considers dierent data trends for dierent classes, while
LSVR focuses on employing dierent data trends in dierent data regions.
This is more valuable with the framework of regression tasks.
The rest of this chapter is organized as follows: the linear LSVR model
with its theoretical background is presented in Section 6.1. In Section 6.2, we
demonstrate how the standard SVR can be considered as the special case of
our proposed model. In Section 6.3, we show the link between our proposed
LSVR model and the general large margin classier M4 . The kernelized LSVR
is tackled by utilizing the Mercers kernel in Section 6.5. Section 6.6 provides
an additional interpretation on the issue of controlling the complexity of the
LSVR model. Section 6.7 presents the experiments on both synthetic and
real data. The chapter is concluded in Section 6.8.
121
w, x Rd , b R.
(6.1)
w
+
C
(i + i ) ,
i
w,b,i ,i N
i=1
i=1
T
s.t. yi (w xi + b) wT i w + i ,
(wT xi + b) yi wT i w + i ,
i 0, i 0, i = 1, . . . , N,
min
(6.2)
(6.3)
where i and i are the corresponding up-side and down-side errors at the
i-th point, respectively, is a positive constant, i is the covariance matrix
formed by the i-th data point and those data points close to it.
122
k
k
1
1
(yi+j yi )2 =
[wT (xi+j xi )]2 = wT i w,
2k + 1
2k + 1
j=k
j=k
where 2k is the number of data points closest to the i-th data point. Therefore, i = wT i w actually captures the volatility in the local region around
the i-th data point. In addition, i can also measure the local functional
complexity around the i-th data, since it reects the smoothness of the corresponding local region. This will be in details addressed later in Section 6.6.
By using the rst meaning of i = wT i w (representing the local volatility), LSVR can systematically and automatically vary the tube: If the i-th
data point
lies in the area with a larger variance of noise, it will contribute to
a larger wT i w or a larger local margin. This will result in reducing the
impact of the noise around the point; on the other hand, in the case that the
i-th data point
is in the region with a smaller variance of noise, the local margin (tube), wT i w, will be smaller. Therefore, the corresponding point
would contribute more in the tting process. In comparison, the standard
SVR adopts a xed margin, which treats each point equally and therefore
lacks the ability to tolerate the change in noise.
By engaging the second compelling property of i = wT i w, namely,
a measure in describing the local functional complexity, LSVR controls the
overall smoothness of the approximating function by minimizing the average
of i as seen in Eq.(6.2). Intuitively, the margin around each point can be
neither too large nor too small: If the margin is too large, the local data
trend may not be captured for over-tolerating data; if the margin is too
small, the local data trend may be over-emphasized resulting in a highly
zig-zag approximating curve. Therefore by adding the regularization term, a
trade-o can be achieved via adapting the parameter C.
min
w,b,i ,i
wT w + C
N
123
(i + i ) ,
i=1
s.t. yi (w xi + b) wT w + i ,
(wT xi + b) yi wT w + i ,
i 0, i 0, i = 1, . . . , N .
T
Further, if = I, we obtain:
min
w,b,i ,i
w + C
N
(6.4)
(i +
i )
(6.5)
i=1
s.t. yi (wxi + b) w + i ,
(wxi + b) yi w + i ,
i 0, i 0, i = 1, . . . , N .
(6.6)
The above optimization problem is very similar to the 1 -norm SVR, except
that it has a margin related to the complexity term. In the following, we will
prove that the above optimization is actually equivalent to the 1 -norm SVR
in a meaningful sense.
Lemma 6.1. The LSVR model with setting i = I is equivalent to the 1 norm SVR in the sense that: (1) Assuming a unique 1 exists for making 1 norm SVR optimal (i.e. setting to 1 will make the objective function minimal), if for 1 the 1 -norm SVR achieves a solution {w , b } = SVR(1 ), then
the LSVR can produce the same solution by setting the parameter = w1 ,
124
w1
w1
(6.9)
Since 1 is the unique making the objective of SVR minimal, Eq.(6.9) implies
that w2 = w1 .
In addition, if in LSVR we use the item of wT w instead of its square
root form as the structure risk or complexity risk, a similar proof will also be
applicable that the 2 -norm SVR is equivalent to the special case of LSVR
with i = . In summary, we can see that the LSVR model actually contains
the standard SVR model as special cases.
,w=0,b
(wT xi + b)
, i = 1, 2, . . . , Nx ,
s.t.
wT x w
(wT y j + b)
, j = 1, 2, . . . , Ny ,
wT y w
(6.10)
(6.11)
(6.12)
where x and y refer to the covariance matrices of the x and the y data,
respectively.
Within the framework of classications, M4 considers dierent data trends
for dierent classes. Analogously, in the novel LSVR model we allow dierent
data trends for dierent regions, which is more suitable for the regression
purpose.
6.5 Kernelization
125
N
N
1
min
ti + C
(i + i ) ,
w,b,ti ,i ,i
N i=1
i=1
s.t. yi (wT xi + b) wT i w + i ,
(wT xi + b) yi wT i w + i ,
w T i w ti ,
ti 0, i 0, i 0, i = 1, . . . , N .
(6.13)
(6.14)
(6.15)
ti + C
(i + i ) ,
min
w,b,ti ,i ,i
N i=1
i=1
s.t. yi (wT xi + b) ti + i ,
(wT xi + b) yi ti + i ,
w T i w ti ,
ti 0, i 0, i 0, i = 1, . . . , N .
6.5 Kernelization
In this section we extend the above linear regression model to the non-linear
one by using the Mercers kernel. Suppose the training data are mapped into
a kernel space or a feature space by the mapping function, : Rd Rf .
Then, the objective in the feature space is transformed as follows:
N
N
1
min
ti + C
(i + i ) ,
(6.16)
w,b,ti ,i ,i
N i=1
i=1
s.t. yi (wT (xi ) + b) ti + i ,
(wT (xi ) + b) yi ti + i ,
wT
i w ti ,
ti 0, i 0, i 0, i = 1, . . . , N .
126
In order to utilize the Mercers kernel, we rst present the following theorem.
Theorem 6.2. If the corresponding local covariance
i can be estimated by
the mapped training data, i.e. i ,
i can be written as
i =
k
1
(xi+j ) ,
2k + 1
(6.17)
j=k
i =
k
1
((xi+j ) i )((xi+j ) i )T ,
2k + 1
(6.18)
j=k
where we just consider 2k data points which are the closest to the i-th data,
then the optimal w lies in the span of the mapped training data.
Proof. Suppose w = wp + wo , where wp is the projection of w in the span
of the mapped training data, wo is the orthogonal component to the span.
Since wT
rmo (xi ) = 0, i = 1, . . . , N , we can easily know that:
wT (xi ) = wT
p (xi ) ,
T
wT
i w = wp i wp .
j=1
N
j K(xi , xj ) = T K i ,
j=1
i w
= T LT
i Li ,
..
..
..
Li = 1 (K [ik:i+k,N ] 12k+1 lT
,
.
i ), K [ik:i+k,N ] =
.
.
2k+1
(lT
i )t =
1
2k+1
k
K i+k,1 . . . K i+k,N
K(xi+j , xt ), and 12k+1 is a column vector with ones of di-
j=k
mension 2k + 1.
Consequently, the corresponding objective in Eq.(6.16) becomes:
min
,b,ti ,i ,i
N
N
1
ti + C
(i + i )
N i=1
i=1
127
,
s.t. yi (T K i + b) ti + i ,
(T K i + b) yi ti + i ,
T LT
i Li ti ,
ti 0, i 0, i 0, i = 1, . . . , N .
Hence we only need a kernel function in the optimization without knowing a
specic mapping function and it can be easily solved by the SOCP methods.
where, i =
1, if xi appears ;
0, otherwise .
(6.20)
It is well known that the 0 -norm of a vector counts the number of elements
dierent from zero. The complexity term can also be described as:
[f ] = wp0 .
(6.21)
128
(6.22)
one looks back on the LSVR model, minimizing (1/N )
wT i w presents
i=1
sparse as possible.1 Another advantage of using (1/N )
wT i w is that
i=1
6.7 Experiments
In this section, we report the experiments on both synthetic sinc datasets and
real world datasets. The SOCP problem associated with our LSVR model is
solved by a general software, Sedumi [18, 19]. The SVR algorithm is performed by LIBSVM [1].
6.7.1 Evaluations on Synthetic Sinc Data
Fifty examples (xi , yi ) are generated from a sinc function [16], where xi are
drawn uniformly from [3, 3], and yi = sin(xi )/(xi ) + i , with i drawn
from a Gaussian with zero mean and variance 2 . Two cases are evaluated.
One is with = 0. The standard deviation of the data in the other case
increases linearly from 0.5 at x = 3 to 1.5 at x = 3. It is clearly observed that
in the second case, the variance of noise is dierent in dierent regions. We use
the default parameters C = 100, the RBF kernel K(u, v) = exp(u v2 ).
Table 6.1 reports the average results over 100 random trails with dierent
values. Fig. 6.2 illustrates the dierence between the LSVR model and the
SVR algorithm when = 0.2. For the case I, = 0.0, the LSVR model can
adjust the tube automatically to t the data with a smaller Mean Square
Error (MSE), which can be seen in Fig. 6.2(c). However, containing a xed
tube, the SVR algorithm lacks the exibility (see Fig. 6.2(a)). This also yields
that the MSE increases as increases. As reported in Table 6.1, when 0.8,
there are no support vectors in SVR and MSE is the largest. In case II, the
LSVR model has smaller MSEs and smaller STDs for all s. Fig. 6.2(d) also
shows that the obtained approximating function in LSVR is smoother than
that in SVR.
1
N
i=1
w T i w would be smaller.
6.7 Experiments
129
Table 6.1. Experimental results (MSESTD) of the LSVR model and the SVR
algorithm on the sinc data with dierent values
Case I: = 0.0
LSVR
0.0
SVR
SVR
0 0.18250.1011 0.31010.1165
0.2 0.0004
0.4 0.0016
0.6 0.0044
0.8 0.0082
1.0 0.0125
2.0 0.0452
130
DJIA
Train
Test
NASDAQ
Train
Test
S&P500
Train
Test
Mean
0.0000 0.3858
S.D.
1.0000
0.9957
1.0000
1.1312
1.0000
Skew
0.0678
0.1684
0.0928
Kurt
2.5437
2.7706
2.6600
1.8631
2.5308
1.1298
2.4124
Following the procedure in [15], we convert the daily closing prices (dt )
of these indices to continuously compounded returns (rt = log(dt+1 /dt )) and
set the ratio of the number of the training return series to the number of
test return series to 5 : 1. We perform normalization on these return series
by Rt = (rt M ean(rt ))/SD(rt ), where the means and standard deviations
are computed for each individual index in the training period.
We compare the performance of the LSVR model against the SVR. The
t = f (xt ), where xt takes the previous four
predicted system is modelled as R
days normalized returns as indicators, i.e. xt = (Rt4 , Rt3 , Rt2 , Rt1 ).
Here this simple setting we employ is based on the suggestions in [15]: A
suitable selection for the sequent values is four. We then apply the modelled
function f to test the performance by one-step ahead prediction. The trade-o
parameter C and the parameter of the RBF kernel (K(u, v) = exp(u
v2 )), (C, ), are obtained by a ve-fold cross-validation conducting the
SVR on the following paired points: [25 , 24 , . . . , 210 ] [25 , 24 , . . . , 210 ].
We obtain the corresponding parameters as (24 , 23 ) for DJIA, (23 , 21 ) for
NASDAQ, and (20 , 22 ) for S&P500.
As suggested in [15], there is a relationship in the sequential ve days
values. We select k = 2, i.e. ve days values, to model the local volatility.
Since when 2.0, there are no support vectors in SVR, we just set the
References
131
values from 0.0, 0.2, . . . , 1.0, to 2.0. The corresponding results are reported
in Table 6.3. As observed, the LSVR model demonstrates a consistent superiority to the SVR algorithm, even though the paired parameters (C, ) are
not tuned for our LSVR model. Furthermore, a paired t-test [13] performed
on the best results of both models in Table 6.3, shows that the LSVR model
outperforms SVR with = 10% signicance level for a one-tailed test.
Table 6.3. Experimental results of the LSVR model and the SVR algorithm on the nancial data with dierent values
DJIA
NASDAQ
LSVR SVR
LSVR SVR
S&P500
LSVR
SVR
6.8 Summary
In this chapter, we propose a Local Support Vector Regression model. Different from the standard Support Vector Regression model, our novel model
oers a systematic and automatic scheme to locally and exibly adapt the
margin. Therefore, it can tolerate the noise adaptively. We demonstrate that
the promising model can not only capture the local information of the data
in approximating functions, but also can branch out similar models to the
standard SVR. The experiments conducted on sinc datasets and three indices
data from stock markets show that our model outperforms the standard SVR.
One future work of this model is to investigate ecient methods to directly
solve the original optimization of LSVR instead of solving a relaxed form. In
addition, both theoretical and empirical comparisons between the true solution and the approximated relaxed solution quantitatively are also valuable
research topics in the future.
References
1. Chang CC, Lin CJ (2001) LIBSVM: A Library for Support Vector Machines
2. Chen S (1995) Basis Pursuit. PhD thesis, Department of Statistics, Standford
University
132
References
7
Extension III: Variational Margin Settings
within Local Data in Support Vector
Regression
134
Rreg (f ) =
(7.1)
where , denotes the inner product. This Euclidean norm w, w measures
the atness of the function f . Minimizing w, w will make the regression
function as at as possible [16].
The function f is then dened as
f (x, w, b) = w, (x) + b ,
(7.2)
where (x) : x , maps x X(Rd ) into a high (possible innite) dimensional space , and b R.
There are several loss functions which could be used to measure the regression error, e.g. squared loss function, Hubers loss function, -insensitive
loss function, etc. In SVR, the -insensitive loss function is used to measure
the loss [19] (illustrated in Fig. 7.1):
0,
if |y f (x)| < ;
l (y, f (x)) =
(7.3)
|y f (x)| , otherwise .
The advantage of this loss function is that it could aect the seeking of
regression function implicitly.
135
(w, b, () ) =
1
(i + i ) ,
w, w + C
2
i=1
N
(7.4)
subject to
yi (w, (xi ) + b) + i ,
(w, (xi ) + b) yi + i ,
()
i
(7.5)
0.
Q(() ) =
1
(i i )(j j )(xi ), (xj )
2 i=1 j=1
N
N
( yi )i +
i=1
N
( + yi )i ,
(7.6)
i=1
subject to
N
(i i ) = 0,
()
[0, C] .
(7.7)
i=1
N
(i i )(xi ), (x) + b ,
i=1
where , are the Lagrange multipliers used to pull and push f towards
to the observation y. Those sample points (xi , yi ) with nonzero i or i are
called support vectors.
By using the trick of kernel function, one could dene the kernel function as the inner product of mapping function, i. e. K(x, z) = (x), (z).
Therefore, one only needs to specify a kernel function without considering the
136
mapping function or the feature space explicitly. The property of the kernel
function is that it should satisfy the Mercers Theorem [6, 14].
Four kernel functions are common used:
Linear function: K(xk , xl ) = xk , xl ;
Polynomial function with parameter d, K(xk , xl ) = (xk , xl + 1)d ;
Radial Basis Function (RBF) with parameter :
K(xk , xl ) = exp(xk xl 2 ) ,
(7.8)
7.3 General
-insensitive Loss Function
First, we note that the margin in -insensitive loss function contains two
characteristics: xed and symmetrical. Based on these two characteristics, we
have proposed a general -insensitive loss function and classied the margin
into four cases in [22]: Fixed and Symmetrical Margin (FASM), Fixed and
137
Asymmetrical
Fixed
FASM
FAAM
Non-xed
NASM
NAAM
(7.9)
l (f (xi ) yi ) = yi f (xi ) u(xi ), if yi f (xi ) u(xi );
138
where d(xi ), u(xi ) 0, are two functions determining the down-margin and
up margin at point xi respectively. When d(x) and u(x) are both constant
functions and d(x) = u(x), Eq.(7.9) amounts to the -insensitive loss function
in Eq.(7.3) and we label it as FASM (Fixed and Symmetrical Margin). When
d(x) and u(x) are both constant functions but d(x) = u(x), this case is
labeled as FAAM (Fixed and Asymmetrical Margin). In the case of NASM
(Non-xed and Symmetrical Margin), d(x) = u(x) but are varied with the
data. The last case is with a non-xed and asymmetrical margin (NAAM)
where d(x) and u(x) are varied with the data and d(x) = u(x).
In the same way, we use the standard method to nd the solution of
Eq.(7.1) with the cost function of Eq.(7.9) as [19] and obtain:
N
1
min
(i + i ) ,
(7.10)
w, w + C
()
2
w,b,
i=1
subject to
yi w, (xi ) b u(xi ) + i ,
w, (xi ) + b yi d(xi ) + i ,
()
0.
Using the standard primal-dual method as above, we also obtain a QP problem as follows:
1
(i i )(j j )(xi ), (xj )
2 i=1 j=1
N
min (() ) =
N
i=1
(u(xi ) yi )i +
N
(d(xi ) + yi )i ,
(7.11)
i=1
subject to
N
(i i ) = 0, i , i [0, C] .
i=1
N
i=1
(i i )(xi ), (x) + b ,
(7.12)
139
When no i
i = 1, . . . , N,
i = 1, . . . , N,
(7.13)
140
(x) < 0, the up margin is smaller than the down-margin and we can overpredict the stock price. A simple illustration is shown in Fig. 7.3. Based on
these observations, in our prediction we assume that we are risk aversion, or
downside risk aversion. When the stock price reveals an uptrend, we know
that it will not be always up, so we tend to under-predict the stock prices
in this case. On the contrary, when the stock price goes down, we tend to
over-predict it. We add this information in the margin setting by controlling
the momentum term.
Fig. 7.3. Margin settings: dashed lines are the bounds of margins; dasheddotted lines are actual data series; solid-bold lines are the new objective
function, f new , by new margin settings. The upper shadow area is the case
of new objective function under-predicted to the actual function; the lower
shadow parts are the case of over-predicted
Actually, there are many ways to calculate the momentum. For example,
the simplest way is to set it as a constant. In this chapter, we will concentrate
on using the Exponential Moving Average (EMA). The reason of using EMA
is that it is time-varying and can reect the uptrend and down-tendency of
the nancial data. A little deciency is that there exists the lag problem. An
n-days EMA sequence begins from the rst day, i. e. EM A1 = y1 and the
following is calculated by:
EM Ai = EM Ai1 (1 r) + yi r ,
where r = 2/(1 + n), and yi is the information about day i, e.g. the closing
price in day i, the volume in day i, etc. Here, the current days momentum
is set as the dierence between the current days EMA and the EMA in the
previous k day, i. e.
(xi ) = EM Ai EM Aik .
7.4.2 GARCH
In the above methods, the datasets we used in the experiments are the price of
the share [22, 23]. We use the standard deviation of input xt , which can reect
7.5 Experiments
141
the volatility of the nancial time series over time, to determine the width of
margin at time t in our prediction. Actually, the Generalized AutoRegressive
Conditionally Heteroscedastic (GARCH) model [3] is a more common used
model to reect the volatility of the nancial time series.
The standard GARCH(p, q) model with Gaussian shocks takes the following form:
yt = c0 + xT
t |t1 = N (0, t2 ) ,
t b + t ,
where
t2 = 0 +
p
i=1
2
i ti
+
q
j 2tj .
j=1
This GARCH toolbox is applied to the return series. So we use the continuous compounded return as the data series and use the t calculated by
GARCH(1,1) as the width of margin at time t.
7.5 Experiments
In this section, we will perform the experiments by using the momentum and
GARCH models to set the margins. Before illustrating the experiments, we
dene the accuracy and risk measurement rst.
7.5.1 Accuracy Metrics and Risk Measurement
In order to measure the prediction performance of our model, we dene the
Mean Absolute Error (MAE).
Let at and pt be the actual values respectively and predicted values at
day t, let m be the number of testing data.
Denition 7.1. Mean Absolute Error (MAE) measures the discrepancy
between the actual and predicted values; the smaller the value of MAE, the
closer are the predicted values to the actual values. MAE is calculated by:
MAE =
m
1
|at pt | .
m t=1
(7.14)
We also consider the risk of using this model in the prediction. Actually,
risk is a term frequently encountered in strategic management and nancial
literature. However, risk has a variety of dierent meanings and rarely is
the meaning used in a particular project claried in [2]. In nancial literature, Markowitz rst formulated the portfolio selection into a mathematical
model [8]. In his model, the return of a portfolio is measured by the expected value of the random portfolio return and the associated risk is quantied by the variance of the portfolio return. However, the use of variance
to measure risk makes no distinction between gains and losses. Markowitz
142
also proposed to use semi-variance to measure the risk of loss. That is the
sum of the squares of negative deviations from the mean divided by the total
number of observations:
1
[min(rt , 0)]2 .
m t=1
m
downside risk
(7.15)
where k is any power that one chooses; when k=1, it should be considered
the absolute value of the term in the brackets and is a chosen benchmark
(not necessarily the mean).
Based on Eq.(7.15), we choose k=1 and dene the following risk measurements.
Denition 7.2. Upside Mean Absolute Error (UMAE) measures upside risk; the smaller the value of UMAE, the smaller the upside risk. UMAE
is dened as:
m
1
UMAE =
(at pt ) .
(7.16)
m t=1
at pt
m
1
(pt at ) .
m t=1
(7.17)
at <pt
7.5.2 Momentum
We compare the modied SVR algorithm by adapting margins using momentum with the AutoRegression (AR) model and the Radial Basis Function
(RBF) method. The results are presented as follows one by one for three
algorithms.
7.5 Experiments
143
HSI
02/01/1998 04/07/2000
16000
227
DJIA
02/01/1998 29/06/2000
8000
222
144
the smallest in all cases of NAAM for dataset HSI. For dataset DJIA, when
the length equals 30, the MAE and the DMAE are also the smallest in all
cases of NAAM.
Table 7.3. Eect of the length of EMA on HSI with parameters
(k, )=(1,1)
Type
DJIA
MAE
UMAE DMAE
216.78
104.58
112.20
85.33
40.29
45.04
10
222.43
115.64
106.79
85.68
43.13
42.55
30
218.18
114.04
104.14
84.12
41.82
42.30
50
217.93
113.38
104.55
84.57
42.12
42.45
100 216.50
113.04
103.46
84.80
42.41
42.39
NASM
NAAM
HSI
In the following, we will use the best length of EMA from the above
experiments for the corresponding datasets, i. e. n = 100 for data set HSI
and n = 30 for dataset DJIA.
(b) When testing the eect of lag k, we let = 1 and set k to 1, 2, 4, 8
respectively for both datasets. The results are listed in Table 7.4. They show
that the MAE increases with increasing of the lag of EMA. These indicate
that the results when the lag of EMA equals 1 are superior to the other cases.
Table 7.4. Eect of the distance of EMA on HSI and DJIA
HSI with (n, k) = (100, 1)
MAE
UMAE
DMAE
MAE
UMAE
DMAE
216.50
113.04
103.46
84.12
41.82
42.30
219.02
125.30
93.72
85.42
43.91
41.51
228.25
149.36
78.88
90.99
49.16
41.83
260.73
200.74
59.99
103.77
58.03
45.74
(c) Here, we set k = 1 and = 1, 1/2, 1/4, 1/8 respectively for both
datasets to see the eect of the . From Table 7.5, we see that the DMAE
increases gradually with decreasing of the coecient of EMA and that the
MAE is smaller than the value in the NASM case. The change of the MAE
for dataset HSI in (24 columns of) Table 7.5 is uctuating and the MAE
7.5 Experiments
145
in (57 columns of) Table 7.5 increases gradually with the decrease of the
coecient of EMA.
Table 7.5. Eect of the coecient of momentum on HSI and DJIA
HSI with (n, k) = (100, 1)
MAE
UMAE
DMAE
MAE
UMAE
DMAE
216.50
113.04
103.46
84.12
41.82
42.30
1/2
216.55
108.97
107.58
84.88
41.32
43.56
1/4
216.19
106.36
109.83
85.02
41.14
43.88
1/8
216.41
105.32
111.08
85.22
40.86
44.36
We also plot the daily closing prices of HSI with 100 days EMA and
the prices of DJIA with 30 days EMA in Fig. 7.4 and Fig. 7.5 respectively,
and list the Average Standard Deviations (ASD) of input x of the training
datasets HSI and DJIA, respectively in Table 7.6, the Average of Absolute
Momentums (AAM) of input x for the best length of both training datasets
respectively in Table 7.6. We can observe that the ASD of HSI is higher than
that of DJIA and that the ratio of AAM to ASD is smaller for HSI than that
for DJIA.
Table 7.6. ASD and AAM
AAM
Dataset
ASD
HSI
DJIA
Ratio
182.28
100
20.80
0.114
79.95
30
15.64
0.196
Now, we will make a summary for the above experiments. At rst, we can
know the eects of n, k and from the above experiments results. Following
these results, we can say that a suitable setting for k and will both be
1, which can be applied when a new dataset comes. The only parameter
needed to determine is the length of EMA, n, this may refer to the ASD of
the training dataset. When the ASD is larger, we may use a longer length
of EMA. On the contrary, when the ASD is smaller, we may use a shorter
length of EMA.
Fixed Cases: After considering the non-xed margin cases, we also test
the predictive results of xed margins. Actually, for dataset HSI, we let
146
7.5 Experiments
147
u(x) ranges from 0 to 90, each increment is also one-tenth of 90, i. e. 9. The
results are listed in (610 columns of) Table 7.7. We can see that for both
datasets, as the up-margin increases, the DMAE tends to decrease.
Table 7.7. Results of FASM and FAAM for HSI and DJIA
HSI [u(x)+d(x)]
DJIA [u(x)+d(x)]
200 236.04
62.24
173.80
90
91.63
20.45
71.18
20
180 230.85
69.65
161.20
81
89.14
23.70
65.44
40
160 226.29
77.37
148.92
18
72
87.35
27.31
60.04
60
140 222.24
85.34
136.90
27
63
86.09
31.18
54.91
80
120 219.35
93.90
125.45
36
54
85.30
35.28
50.02
100
114.69
45
45
85.45
39.86
45.59
120
80
217.35 112.90
104.45
54
36
86.33
44.80
41.53
140
60
217.88 123.16
94.72
63
27
87.40
49.83
37.57
160
40
219.49 133.97
85.52
72
18
88.64
54.95
33.69
180
20
221.66 145.05
76.61
81
90.80
60.53
30.27
200
224.83 156.64
68.19
90
93.75
66.51
27.24
Comparing the results in Table 7.3 with the results in Table 7.7 (the
experimental results are plotted in Fig. 7.6(b) and Fig. 7.7(b) respectively),
we can see that NASM and NAAM are both superior to FASM and FAAM
in both datasets.
In the following, we will perform other models, such as AR models and
RBF network, on the above two datasets. The best results of all the models
are illustrated in Fig. 7.6(a) for HSI and Fig. 7.7(a) respectively.
7.5.2.2 AR Models
For AR models, we use the AR model with order 4 to predict the prices of
HSI and DJIA, hence we can compare the AR model with NASM, NAAM in
SVR with the same order. The results are listed in the Table 7.8. From these
results, we can see that NASM and NAAM are superior to AR model with
the same order.
148
MAE
UMAE
DMAE
HSI
217.75
105.96
111.79
DJIA
88.74
46.36
42.38
7.5 Experiments
149
Hidden No.
DJIA
88.31 44.60
43.71
98.44 48.46
49.98
90.53 46.22
44.31
87.23 44.09
43.14
7.5.3 GARCH
In this experiment, the experimental data are 3 years daily closing indices
(20002002) from stock markets in dierent countries:
Nikkei225: Nikkei225 Stock Average from Japan, the daily closing prices
are plotted in Fig. 7.11(a);
DJIA00-02: Dow Jones Industrial Average (DJIA) from USA, the daily
closing prices are plotted in Fig. 7.13(a);
FTSE100: FTSE100 index from UK, the daily closing prices are plotted
in Fig. 7.15(a).
In the data processing step, the daily closing prices of these indices are
converted to continuously compounded returns and the ratio of the number
of training data to the number of testing data is set to 5:1. Therefore, we
obtain and list the corresponding training and testing periods in Table 7.10.
Table 7.10. GARCH experimental data description
Indices
Training period
Testing period
Nikkei225
DJIA00-02
FTSE100
7.5.3.1 GARCH(1, 1)
We apply the Matlab toolbox to calculate the GARCH model. In the Matlab
toolbox, Before running the SVR algorithm, we run the GARCH(1,1) model
to determine the width of margin in SVR. For Nikkei225, we obtain the
parameter estimates and their standard errors in Table 7.11, i. e. the best ts
for Nikkei225 by (1,1) is:
150
yt = 0.49468 + t ,
2
+ 0.0772182t1 .
t2 = 0.00073917 + 0.8682t1
Parameter
Value
c0
0.49468
0.0045008
109.9083
0.00073917
0.00034866
2.1200
GARCH(1)
0.8682
0.048144
18.0334
ARCH(1)
0.077218
0.027279
2.8306
error
statistic
Fig. 7.8. GARCH(1,1) of Nikkei225. The color-coded bar at the right of (a) indicates the height of the log-likelihood surface of the GARCH(1,1) plane
7.5 Experiments
151
Parameter
Value
c0
0.60363
0.0041185
146.5631
0.00056832
0.00023491
2.4193
GARCH(1)
0.85971
0.031773
27.0580
ARCH(1)
0.092295
0.020352
4.5350
error
statistic
Fig. 7.9. GARCH(1,1) of FTSE100. The color-coded bar at the right of (a) indicates the height of the log-likelihood surface of the GARCH(1,1) plane
152
yt = 0.50444 + t ,
2
+ 0.126932t1 .
t2 = 0.0011599 + 0.82253t1
Value
c0
0.50444
0.0053313
error
T
statistic
94.6180
0.0011599
0.00049206
2.3573
GARCH(1)
0.82253
0.04906
16.7658
ARCH(1)
0.12693
0.034698
3.6582
Fig. 7.10. GARCH(1,1) of DJIA00-02. The color-coded bar at the right of (a)
indicates the height of the log-likelihood surface of the GARCH(1,1) plane
7.5 Experiments
153
we train the normalized training data once and then obtain the normalized
predicted return value pni = f (xi ), where xi = (ti4 , ti3 , ti2 , ti1 ). Finally,
we unnormalize pni , convert the result to price and obtain the corresponding
predicted price pi .
Before running the SVR algorithm, we have to choose two parameters: C,
the cost of error; , the parameter of kernel function. Here the parameters
we choose are the same respectively for dierent indices. They are listed in
Table 7.14.
Table 7.14. Parameters in GARCH experiments for NASM
Indices
Nikkei225
24
DJIA
24
FTSE100
24
154
u(x)
d(x)
MAE
UMAE
DMAE
NASM
124.37
55.97
68.40
0.10
141.60
30.70
110.90
0.02
0.08
131.25
39.02
92.23
0.04
0.06
125.63
49.66
75.97
0.06
0.04
123.11
61.81
61.30
0.08
0.02
124.00
75.63
48.37
0.10
129.19
91.56
37.63
FAAM
u(x)
d(x)
MAE
UMAE
DMAE
NASM
129.56
62.74
66.83
0.10
139.82
41.56
98.26
0.02
0.08
134.33
49.16
85.17
0.04
0.06
130.49
57.56
72.93
0.06
0.04
128.51
66.87
61.64
0.08
0.02
129.65
77.72
51.94
0.10
133.76
90.02
43.74
FAAM
u(x)
d(x)
MAE
UMAE
DMAE
NASM
69.61
33.42
36.19
0.10
73.46
25.93
47.53
0.02
0.08
71.98
28.52
43.46
0.04
0.06
70.83
31.27
39.56
0.06
0.04
70.10
34.22
35.88
0.08
0.02
69.86
37.42
32.45
0.10
70.26
40.92
29.34
FAAM
7.6 Discussions
155
Order
Nikkei225
DJIA00-02
FTSE100
125.31 53.40
71.91
128.58 61.67
66.91
71.44 33.9
37.53
125.68 53.31
72.36
130.00 62.08
67.92
71.40 33.46
37.94
125.67 53.37
72.30
130.56 62.50
68.06
70.41 32.76
37.65
125.22 52.91
72.31
131.20 62.93
68.27
69.96 32.76
37.20
125.32 53.08
72.24
131.27 62.90
68.38
70.12 32.89
37.23
125.40 52.72
72.68
131.32 62.89
68.43
69.99 32.78
37.21
Table 7.15 and (24 columns of) Table 7.18. The predictive error and risks
of DJIA00-02 are shown in Fig. 7.13(b), where the corresponding bar values
are from Table 7.16 and (57 columns of) Table 7.18. The predictive error
and risks of FTSE100 are shown in Fig. 7.15(b), where the corresponding bar
values are from Table 7.17 and (810 columns of) Table 7.18.
7.6 Discussions
Having described the experiments and their results, we know that NASM is
superior to FASM and FAAM generally. One reason is that NASM catches
the stock market information and adds the information into the setting of the
156
Fig. 7.12. Experimental results graphs using GARCH method for Nikkei225
Fig. 7.14. Experimental results graphs using GARCH method for DJIA00-02
7.6 Discussions
157
Fig. 7.16. Experimental results graphs using GARCH method for FTSE100
margin. This provides helpful information for the prediction. Another reason
is that by using NASM, the margin width is determined by a meaningful
value. This value changes with the stock market. Obviously, this method is
more exible than xed margin cases and avoids risk of getting bad predictive
results partially when the margin values are determined by random selection
in the xed margin cases.
Furthermore, we know that NAAM may be better than NASM. For
example, by adding a momentum, we may not only improve the accuracy
of prediction, but also reduce the predictive downside risk.
Another notice is that by cautiously selecting parameters, SVR algorithm
has similar predictive performance to other models, from Figs. 7.6(a) and
7.7(a). However, for a novice, the SVR libraries are easy to run. Since every
local optimum is the global optimum, it guarantees the user to nd an optimal
solution easily and stably. This advantage is very useful for a novice to learn
a new model, or library, and strengthen his condence of learning new things
comparing with learning other non-linear model, e. g. RBF networks.
158
References
In general, our methods can be considered as a model selection, determining the parameter, . We do not consider the setting of other parameters,
such as C and . We just use the cross-validation technique to nd suitable
values for them. However, this procedure is time-consuming. We may add
some market information to set these parameters, e. g. [4]. In addition, the
margin width set by GARCH model is too wide; we may need to add more
useful terms to shrink it. This can be one of our future works. A valuable
experience is that the normalized procedure will be helpful for selecting suitable parameters easily and stably.
Finally, we turn to a key weakness of our model: the predictive model
does not lead to direct prot making in real life and we do not provide the
condence of these predictive models. However, we may nd some useful
information through using our model to predict the stock market prices; the
predictive results may provide some helpful suggestions.
References
1. Gustavo M, de Athayde (2001) Building a Mean-downside Risk Portfolio Frontier. In: Sortino F.A, Satchell S.E, editors, Managing Downside Risk in Financial Markets: Theory, Practice and Implementation. Oxford, Boston:Butterworth-Heinemann 194211
2. Baird IS, Howard T (1990) What Is Risk Anyway? Using and Measuring Risk
in Strategic Management. In Bettis Richard A and Thomas Howard, editors,
Risk, Strategy and Management. Greenwich, Conn: JAI Press 2151
3. Bollerslev T (1986) Generalized Autoregressive Conditional Heteroskedasticity.
Econometrics 31:307327
4. Cao LJ, Chua KS, Guan LK (2003) c-Ascending Support Vector Machines for
Financial Time Series Forecasting. In International Conference on Computational Intelligence for Financial Engineering (CIFEr2003) 329335
5. Chang CC, Lin CJ (2001) LIBSVM: A Library for Support Vector Machines
6. Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines(and Other Kernel-based Learning Methods). Cambridge, U.K.; New
York: Cambridge University Press
7. Hastie T, Rosset S, Tibshirani R, Zhu J (2004) The entire regularization path
for the support vector machine. Journal of Machine Learning Research 5:1391
1415
8. Markowitz H (1952) Portfolio Selection. Journal of Finance 7:7791
9. Mukherjee S, Osuna E, Girosi F (1997) Nonlinear Prediction of Chaotic Time
Series Using Support Vector Machines. In Principe J, Giles L, Morgan N,
Wilson E, editors, IEEE Workshop on Neural Networks for Signal Processing
VII. IEEE Press 511519
10. M
uller KR, Smola A, R
atsch G, Sch
olkopf B, Kohlmorgen J, Vapnik V (1997)
Predicting Time Series with Support Vector Machines. In Gerstner W, Germond A, Hasler M, and Nicoud JD, editors, ICANN. New York, NY: Springer
9991004
11. Nabney IT (2002) Netlab: Algorithms for Pattern Recognition. New York, NY:
Springer
References
159
12. Sch
olkopf B, Chen PH, Lin CJ (2003) A Tutorial on -Support Vector Machines. Technical Report, National Taiwan University
13. Sch
olkopf B, Bartlett P, Smola A, Williamson R (1998) Support Vector Regression with Automatic Accuracy Control. In Niklasson L, Boden M, and Ziemke
T, editors, Proceedings of ICANN98 Perspectives in Neural Computing. Berlin
111116
14. Sch
olkopf B, Bartlett P, Smola A, Williamson R (1999) Shrinking the Tube:
A New Support Vector Regression Algorithm. In Kearns MS, Solla SA, Cohn
DA, editors, Advances in Neural Information Processing Systems. Cambridge,
MA: The MIT Press 11: 330336
15. Sch
olkopf B, Smola AJ, Williamson R, Bartlett P (1998) New Support Vector Algorithms. Technical Report NC2-TR-1998-031, GMD and Australian
National University
16. Smola A, Sch
olkopf B (1998) A tutorial on support vector regression. Technical
Report NC2-TR-1998-030, NeuroCOLT2
17. Smola AJ, Murata N, Sch
olkopf B, M
uller KR (1998) Asymptotically Optimal
Choice of -Loss for Support Vector Machines. In Proc. of Seventeenth Intl.
Conf. on Articial Neural Networks
18. Trafalis TB, Ince H (2000) Support Vector Machine for Regression and Applications to Financial Forecasting. In Proceedings of the IEEE-INNS-ENNS
International Joint Conference on Neural Networks (IJCNN2000). IEEE 6: 348
353
19. Vapnik VN (1999) The Nature of Statistical Learning Theory. New York, NY:
Springer, 2nd edition
20. Vapnik VN, Golowich S, Smola AJ (1997) Support Vector Method for Function Approximation, Regression Estimation and Signal Processing. In Mozer
M, Jordan M, Petshe T, editors, Advances in Neural Information Processing
Systems. Cambridge, MA: The MIT Press 9: 281287
21. Wang G, Yeung DY, Lochovsky FH (2006) Two-dimensional solution path for
support vector regression. In The 23rd International Conference on Machine
Learning. Pittsburge, PA: 19931000
22. Yang H, Chan L, King I (2002) Support Vector Machine Regression for Volatile
Stock Market Prediction. In Yin Hujun, Allinson Nigel, Freeman Richard,
Keane John, and Hubbard Simon , editors, Intelligent Data Engineering and
Automated Learning IDEAL 2002. NewYork, NY: Springer 2412 of LNCS:
391396
23. Yang H, King I, Chan L (2002) Non-xed and Asymmetrical Margin Approach
to Stock Market Prediction Using Support Vector Regression. In International
Conference on Neural Information Processing ICONIP 2002, 1968
8
Conclusion and Future Work
In this chapter, a summary of this book is provided. We will review the whole
journey of this book, which starts from two schools of learning thoughts
in the literature of machine learning, and then motivate the resulting combined learning thought including Maxi-Min Margin Machine, Minimum Error
Minimax Probability Machine and their extensions. Following that, we then
present both future perspectives within the proposed models and beyond the
developed approaches.
162
163
164
References
the projection from the original space to the feature space. This can also
be considered as a task on how to choose a suitable kernel, which currently
attracts much interest in the machine learning community [4, 15].
Another important future direction for the proposed classication models,
i.e. Minimum Error Minimax Probability Machine and Maxi-Min Margin
Machine, is how to extend the current binary classications into multi-way
classications. Although one vs. all and one vs. one [1, 16] approaches present
the main tools for conducting the upgrading, one always prefers to a more
systematic and more rigorous approach.
8.2.2 Beyond the Proposed Models
Although several important models have been motivated and developed from
the viewpoint of learning from data both globally and locally, beyond these
models there are plenty of work deserving future investigations.
One natural question is whether other famous local models or global models can be extended by engaging the viewpoint of learning from data globally
and locally. For example, Neural Networks, a large family of popular learning
models, might be also considered as modelling data in a local fashion. It is
therefore very interesting to investigate whether global information can also
be incorporated into these kinds of learning processes.
It is noted that the learning discussed in this book is restricted within
the framework of either classication or regression tasks. Both tasks belong
to the so-called supervised learning [5, 9, 18]. However, the other largely
dierent learning paradigm, unsupervised learning [10, 13, 17], and the recently emerging semi-supervised learning [2, 3, 8, 7] are not considered. Therefore, exploring possible applications of hybrid learning in this eld presents
a straightforward and immediate ongoing topic.
References
1. Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: A
unifying approach for margin classiers. Journal of Machine Learning Research
1:113141
2. Altun Y, McAllester D, Belkin M (2005) Maximum margin semi-supervised
learning for structured variables. In Advances in Neural Information Processing
Systerm (NIPS 18)
3. Ando R, Zhang T (2005) A framework for learning predictive structures from
multiple tasks and unlabeled data. Journal of Machine Learning Research
6:1817C1853
4. Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic
duality, and the SMO algorithm. In Proceedings of International Conference
on Machine Learning (ICML-2004)
5. Bartlett PL (1998) Learning theory and generalization for neural networks
and other supervised learning techniques. In Neural Information Processing
Systems Tutorial
References
165
Index
A
AutoRegression (AR) 143, 147
B
Bayes optimal Hyperplane 33, 38
Bayesian Average Learning 19
Bayesian Optimal Decision 2
Bayes Point Machine 19
Bayesian Networks 1
Biased Classication 33
Biased Minimax Probability Machine
(BMPM) 33, 97
C
C4.5 105
Central Limit Theorem 40
Conic Programming 70
Concave-convex FP 36
Conjugate Gradient method 36
Cross validations 91
D
Data Orientation 76
Data Scattering Magnitude 76
Deterministic Annealing 161
Dictionary 127
Distribution-free 32
Divide and Conquer 73
Down-sampling 98
Down Side Mean Absolute
Error (DMAE) 142
E
Expectation Maximization (EM) 19
F
Financial time series 129
Fisher Discriminant Analysis (FDA)
77
Fixed and Asymmetrical Margin
(FAAM) 137
Fixed and Symmetrical Margin
(FASM) 136
Fractional Programming (FP) 36
G
Gabriel Graph 4
Game Theory 32
Gaussian Mixture Models 1
Generalized AutoRegressive Conditionally Heteroscedastic (GARCH)
168
Index
141
Generative Learning 16
Global Learning 16
Global Modeling 1
H
Hidden Markov Models 1
Hybrid Learning 5, 24
I
Imbalanced Learning 97
Independent, Identically Distribution- N
al (i.i.d.) 18
Naive Bayesian (NB) 16, 102
Non-xed and Symmetrical Margin
K
(NASM) 137
Non-xed and Asymmetrical Margin
k-Nearest-Neighbor 19,20,105
(NAAM) 137
Kernelization 45, 84, 125
Non-parametric Learning 19
Nonseparable Case 79
L
Lagrangian Multiplier 34
Large margin classiers 22, 69
Line Search 38
Locally and Globally 69
Local Modeling 3
Local Learning 22
Local Support Vector
Regression (LSVR) 119, 121
lpp-SVM 72
Lyapunov Condition 40
O
Over-tting 23
P
Parametric Method 41
Parzen Window 19, 20
Pseudo-concave Problem 36
Q
M
Mahalanobis Distance 72
Markov Chain Monte Carlo 19
Marshall and Olkin Theory 30
Maxi-Min Margin Machine (M4) 6,
25, 69
Maximum A Posterior (MAP) 17
Maximum Conditional Learning 18
Maximum Entropy Estimation 19
Index
Reduction 83
Robust Version 43
Rooftop 107
Rosen gradient projection 36
169
T
Tikhonovs Variation Method 80
Unbiased classication 33
Unsupervised Learning 162
Up-sampling 98
Up Side Mean Absolute Error
(UMAE) 142
V
Variational Margin Setting 134
VC dimension 24
Vector Recovery Index 65, 100
v-SVR 136
W
Weighted Support Vector Machine
34
Worst-case 32, 38
(n; k; )-bound problem 57