Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
This paper presents a novel cluster oriented ensemble classifier. The proposed ensemble
classifier is based on original concepts such as learning of cluster boundaries by the base
classifiers and mapping of cluster confidences to class decision using a fusion classifier. The
categorised data set is characterised into multiple clusters and fed to a number of distinctive
base classifiers. The base classifiers learn cluster boundaries and produce cluster confidence
vectors. A second level fusion classifier combines the cluster confidences and maps to class
decisions. The proposed ensemble classifier modifies the learning domain for the base
benchmark data sets from UCI machine learning repository to identify the impact of multi-
cluster boundaries on classifier learning and classification accuracy. The experimental results
and two–tailed sign test demonstrate the superiority of the proposed cluster oriented ensemble
1. Introduction
separately learn the class boundaries over the patterns in a training set. The decision of an
ensemble classifier on a test pattern is produced by fusing the individual decisions of the base
classifiers. Ensemble classifiers are also known as multiple classifier systems, committee of
classifiers and mixture of experts [1]. An ensemble classifier produces more accurate
classification than its individual counterparts provided the base classifier errors are
uncorrelated [3].
Contemporary ensemble generation techniques train the base classifiers on different subsets
of the training data in order to make their errors uncorrelated. The different algorithms
including bagging [4] and boosting [7] vary in terms of generating the training subsets for
base classifier training. The decisions of the base classifiers are fused into a single decision
by using either majority voting on discrete decisions [1] or algebraic combiners [15] on
(detailed in Section 2) are capable of making the base classifier errors uncorrelated they fail
to establish any mechanism to improve the learning domain of the individual base classifiers.
To clarify this concern let us consider a real world data set with overlapping patterns from
different classes. The learning of class boundaries between overlapping class patterns in such
cases is a difficult problem. Excessive training of the base classifiers will lead to accurate
learning of the decision boundary but resulting in overfitting thus misclassifying instances of
test data. On the other hand learning generalized boundaries will avoid overfitting but at the
cost of always misclassifying some overlapping patterns. This problem on learning the class
boundaries of overlapping patterns remains inherent in all the base classifiers and is
2
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
propagated to the decision fusion stage as well even though the base classifier errors are
uncorrelated.
We opt to bring in clustering at this point. Clustering is the process of partitioning a data set
into multiple groups where each group contains data points that are very close in Euclidian
space. The clusters have well defined and easy to learn boundaries. Let’s assume that the
patterns are labelled with their cluster number. Now if the base classifiers are trained on the
modified data set they will learn the cluster boundaries. As the clusters have well defined
easy to learn boundaries the base classifiers can learn them with high accuracy. Clusters can
contain overlapping patterns from multiple classes. A fusion classifier can be trained to
predict the class of a pattern from the predicted cluster. The proposed cluster oriented
With the aim to achieve better learning and improved accuracy of the ensemble classifier, in
this paper we propose an ensemble classifier approach that clusters classified data into
multiple clusters, learns the decision boundaries between the clusters using a set of base
classifiers and combine the cluster decisions produced by the base classifiers into class
the base classifiers. The fusion classifier maps the clustering pattern produced by the base
classifiers into class decision. Altogether the ensemble of base and fusion classifiers aims
better learning leading to higher classification accuracy as evidenced from the experimental
results.
While achieving the above mentioned aim, the research presented in this paper would like to
find out the answers of four major research questions. The first research question is to
clustering (i.e. clustering all the patterns from different classes) and homogeneous clustering
3
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
(i.e. clustering patterns within a class). The second research question is to investigate whether
the ensemble classifier outperforms the base classifiers significantly. The third research
question is to find out the impact of fusion classifier. The final research question is to find the
standing of the proposed ensemble classifier with respect to other ensemble classifiers on
This paper is organized as follows. Section 2 presents the literature review. The proposed
Section 5 describes the experimental setup used for evaluating the proposed approach.
Section 6 presents the results and comparative analysis. Finally, Section 7 concludes the
paper.
2. Literature Review
The major concentration of ensemble classifier research [1]–[2] is on (i) generation of base
classifiers for achieving diversity among them, and (ii) methods for fusing the decision of the
base classifiers. Two classifiers are diverse if they make different errors on different instances.
The ultimate objective of diversity is to make the base classifiers as unique as possible with
Bagging [4][6] is a sampling based ensemble classifier generation approach that was
introduced by Breiman. Bagging generates the multiple base classifiers by training them on
data subsets randomly dawn (with replacement) from the entire training set. The decisions of
the base classifiers are combined into the final decision by majority voting. The sampling
procedure of bagging creates the various training subsets by bootstrap sampling which results
in the diversity among the base classifiers. Bagging is suitable for small data sets. For large
data sets however the sampling scheme based on the bootstrap with replicates of the training
4
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
set is infeasible. Moreover, the randomness introduced by the sampling process in bagging
cannot guarantee the performance of the overall ensemble classifier. A number of variations
to bagging are observed in the literature to improve its performance and the list includes
random forests [5], ordered aggregation [11], adaptive generation and aggregation approach
Schapire proposed a method called boosting [7][8] that creates data subsets for base classifier
training by re-sampling the training data, however, by providing the most informative
training data for each consecutive classifier. In boosting each of the training instances is
assigned a weight that determines how well the instance was classified in the previous
iteration. The subset of the training data that is badly classified (i.e. instances with higher
weights) are included in the training set for the next iteration. This way boosting pays more
attention to instances that are hard to classify. Although boosting identifies difficult to
classify instances it does not provide any mechanism to improve the learning of base
classifiers on these instances. The problem of base classifier learning that is raised by
overlapping patterns still remains (as mentioned in the previous section), and leads to poor
literature including boosting recombined weak classifiers [12], weighted instance selection
Random subspace [9] is an ensemble creation method that uses feature sub sets to create the
different data subsets to train the base classifiers. Maclin and Shavlik proposed a neural
ensemble [22] where a number of new approaches are presented to initialise the network
weights in order to achieve diversity and generalization. Pujol and Masip presented a binary
discriminative learning technique [23] based on the approximation of the non-linear decision
boundary by a piece-wise linear smooth additive model. Chaudhuri et. al. presented a hybrid
ensemble model [24] that combines the strengths of parametric and non–parametric
5
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
classifiers. In recent times there are some works relating to cluster ensembles that aims to
obtain improved clustering of the data set by combining multiple partitioning of the data set
[25]. Note that the focus of ensemble classifier is to obtain improved classification accuracy
that is significantly different from cluster ensembles that aims to achieve improved clustering
accuracy.
The other key aspect of ensemble classifier is the fusion of base classifier outputs into class
decisions. The mapping can be done on discrete class decisions or continuous class
confidence values produced by the base classifiers. The commonly used fusion methods [1]
for combining class labels are majority voting, weighted majority voting, behaviour
knowledge space, and Borda count. The commonly used fusion methods for combining
continuous outputs are algebraic combiners [15] including mean rule, weighted average,
trimmed mean, min/max/median rule, product rule, and generalized mean. A number of other
fusion rules include decision template [16], pair-wise fusion matrix [17], adaptive fusion
method [18], and non–Bayesian probabilistic fusion [19]. Note that all these approaches are
designed to fuse the class decisions from the base classifiers into a single class decision.
Summarizing, the contemporary ensemble classifier generation methods are able to produce
diversity among the base classifiers by making their errors uncorrelated. It however does not
provide any mechanism to improve the learning process of the individual base classifiers on
difficult to classify overlapping patterns. The proposed ensemble classifier aims to address
this issue by creating multiple boundaries through data clustering, training the base classifiers
on easy to learn cluster boundaries and handling the cluster to class mapping process by a
fusion classifier. The overall philosophy of the proposed approach is presented in the
following section.
6
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
3.1 Motivation
The decision boundaries in real world data sets are not simple. This is primarily because of
overlapping patterns from different classes in the data set. As a result the learning of decision
boundaries in such data sets leads to either overfitting or poor generalization. In both cases it
causes classification errors. The situation is explained in Figure 1. The data set in Figure 1(a)
contains overlapping patterns from two classes. Accurate learning from the training data by a
generic classifier will result in class boundaries in Figure 1(b) leading to overfitting and thus
reducing penalties for misclassification during training. In this case the generic classifier will
learn simple decision boundaries (Figure 1(c)) but will cause misclassification of training as
This is the point where we would like to introduce multiple decision boundaries for each
class through clustering. Clustering is the process of grouping similar patterns. Clustering the
data set in Figure 1(a) with overlapping patterns will result in smaller groups of patterns as in
Figure 1(d). Note that the cluster boundaries (Figure 1(e)) are simple and easy to learn. A
generic classifier, if trained, now learns simple cluster boundaries that neither causes
overfitting nor extreme generalization. Cluster to class mapping can be done by a fusion
classifier. The underlying theoretical model and the methodologies of the proposed ensemble
classifier are based on the above fact and are presented bellow.
7
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
(d) (e)
Figure 1: Impact of clustering on an example data set consisting of two classes. (a) The
original data set with overlapping patterns, (b) Overfitting caused by accurate learning of the
decision boundaries, (c) Generalized decision boundary with overlapping patterns of class
two considered as part of class one, (d) Clustered data set, and (e) Decision boundaries
learned on clustered data set.
Let the ensemble classifier is composed of a set of ܰ base classifiers ߰ଵ , ߰ଶ , … , ߰ே್ and a
fusion classifier ߮. Given a pattern ࢞ the ensemble classifier ݂ can be defined to achieve the
following mapping:
where ݐଵ , … , ݐே are class confidence values for the ܰ classes. The base and fusion classifiers
Assuming that the data set is partitioned into ܭclusters, each pattern belongs to a cluster. The
base classifier ߰ is set to map the input pattern ࢞ to a set of cluster confidence measures
ݓଵ , … , ݓ as
8
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
The training set Ȟ of a base classifier ߰ is made of pairs (࢞, [ݓଵ , … , ݓ ]) where ࢞ represents
the input and [ݓଵ , … , ݓ ] represents the target. Given that ࢞ belongs to cluster k the target
1 ݂݅ ݆ = ݇
ݓ = ൝ . (3)
0 ݁ݏ݅ݓݎ݄݁ݐ
where ߝ is the error function. Let ߰ఏഗ (࢞) = [ߛଵ , … , ߛ ]. The error function ߝ for the base
classifier is defined as
ߝ = σ
ୀଵ|ݓ െ ߛ |. (5)
Given the cluster confidence vectors produced by the base classifiers the fusion classifier ߮
where the ݓଵ , … , ݓ are the class confidence measures produced by base classifier ߰ and
ݐଵ , … , ݐே are class confidence values. The training set for the fusion classifier Ȟ is composed
of pairs ൫࢝, [ݐଵ , … , ݐே ]൯ where ࢝ is the cluster confidence vector and [ݐଵ , … , ݐே ] is the target
class confidence vector. A cluster can contain patterns from multiple classes and in that case
a unique mapping is not possible by the fusion classifier. Depending on the number of classes
ܰ , each class deserves a share of the cluster. There are thus a total of ܰ outputs/targets of
the fusion classifier each representing a class and each target receives a weight during
training according to the proportion of its patterns in the cluster. Let the cluster confidence
9
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
vectors produced by the base classifiers in (6) correspond to cluster k that contains ݊ patterns
of class j where 1 ݆ ܰ . The target class confidence for the j–th class is set as
ೕ
ݐ = ಿ . (7)
σೕసభ ೕ
The parameters Ʌఝ for the fusion classifier ߮ are optimized such that
where ߝ is the error function. Assuming ߮ఏക (࢝) = [ߟଵ , … , ߟே ] the error function ߝ is
defined as:
ே
ߝ = σୀଵ
หݐ െ ߟ ห. (9)
Using (2) and (6), the ensemble classifier mapping in (1) can be enumerated as:
The proposed ensemble classifier is based on the above model and corresponding architecture
is presented in Figure 2.
The objective of the proposed Cluster Oriented Ensemble Classifier (COEC) is to improve
Input Base classifier Cluster Confidence Vector Fusion classifier Class Confidence Vector
x \2 w21 ,, w2 K
M t1 ,, t N c
\ N bc wN bc 1 ,, wN bc K
10
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
the learning process as well as the overall prediction accuracy by partitioning the data set,
learning cluster boundaries by the base classifiers and mapping base classifiers’ output to
class confidence vector using a fusion classifier. The novelty of the proposed method lies in:
(i) Partitioning classified data into multiple clusters for achieving better separation.
(iii) Fusion of cluster confidence values produced by the base classifiers into class
The learning of the base and fusion classifiers in COEC depends on multiple class boundaries
produced by clustering. The outcome of the clustering algorithm depends on the similarity
measure ο between the patterns and we have used Euclidian distance that computes the
geometric distance between two patterns ࢞ =< ݔଵ , ݔଶ , … , ݔ > and
࢞ =< ݔଵ , ݔଶ , … , ݔ > in n–dimensional hyperspace. We performed two types of clustering
in COEC:
(i) Heterogeneous clustering to partition all the patterns in the training set independent of any
(ii) Homogeneous clustering for partitioning the patterns belonging to a single class only.
The characteristics and outcome of the two types of clustering is significantly different and
11
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
= ܬσ
ୀଵ σ࢞ אȍ ο(࢞ , ૂ ), (11)
for the patterns in the corresponding training set. Considering an augmented training set īሖ
classifier ߰ learns the decision boundaries between the clusters and produces cluster
confidence vector ݓଵ , … , ݓ . The fusion classifier maps cluster confidence vector to class
The performance of the fusion classifier depends on the content of the cluster. If all the
patterns in the cluster belong to the same class the mapping is unique. We refer to these
clusters as atomic clusters. Non–atomic clusters are composed of patterns from different
classes. The target vector of the fusion classifier for these clusters is set according to the
The overall learning and prediction methodology of COEC is presented in Figure 3 and
Figure 4. The learning process is depicted in Figure 3 where the training data is first clustered
and the base classifiers then learn the mapping from patterns to clusters. The cluster
confidence values produced by the different base classifiers are then merged to form the
inputs for the fusion classifier and the targets are set to the original class values for learning
the cluster to class map. During prediction (Figure 4), the base classifiers produce cluster
confidence vectors for a test pattern. These vectors are merged to form the input for the
The different steps of learning and prediction of the ensemble of classifiers are detailed in the
following sections.
12
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
13
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
The learning process starts by partitioning the training data into multiple clusters. Given the
training data set [ݔ ] = [݀ ] Ȉ [݈ܿܽݏݏ ] where 1 ݅ ܰ௫௦ and 1 ݆ ܰ௧௨௦ , the
purpose of the clustering algorithm is to partition the training data set into a number of
ܰ௨௦௧ clusters. The output of the clustering algorithm is the modified data set [ݕ ] =
[݀ ] Ȉ [݈ܿݎ݁ݐݏݑ ]. Given the training data set, the clustering algorithm is presented in Figure
5. At the completion of clustering each row of [݀ ] is augmented with cluster id producing
The output of the clustering algorithm depends on the input argument type. We have used two
separately on the patterns belonging to the same class, and (ii) Heterogeneous clustering:
Clustering is performed on the entire data set. We have reported our findings on both of these
14
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
A set of ܰ base classifiers are trained with [ݕ ] = [݀ ] Ȉ [݈ܿݎ݁ݐݏݑ ] as produced by the
clustering algorithm. The input to each base classifier is set to [݀ ]. The target for each base
1 ݂݅ ݈ܿݎ݁ݐݏݑ = ݇
ݐ = ൝ , (12)
0 ݁ݏ݅ݓݎ݄݁ݐ
where 1 ݇ ܰ௨௦௧௦ . The aim of training the base classifiers with the target cluster
matrix is that during prediction the base classifiers produce cluster confidence values for a
pattern. The training parameters for each base classifier are optimized to fit the training data.
The training algorithm for a generic classifier is presented in Figure 6. At the completion of
training, for each base classifier ܾ a model ߠ is obtained where 1 ܾ ܰ and [݀ ] is
presented to each of the base classifiers producing a set of ܾ cluster confidence matrices
{[ݓ ]} for the training patterns where 1 ܾ ܰ , and 1 ݇ ܰ௨௦௧௦ .
15
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
The confidence matrices produced by the base classifiers are combined to form the input to
the fusion classifier ߮ where 1 ݅ ܰ௫௦ and 1 ݇ ܰ௨௦௧௦ . The target matrix for
߮ is composed of class confidence vectors that are set according to the proportion of class
instances within the cluster. The parameters for fusion classifier ߮ are optimized to fit the
above input-output pattern produced by the training examples. At the completion of training a
model for the ensemble classifier ߠ ఝ is obtained. The training algorithm for the fusion
4.4 Prediction
The test pattern ݁ =< ݁ଵ , … , ݁ேೌೠೝೞ > is presented to each of the base classifiers. Each
base classifier ܾ produces ܰ௨௦௧௦ different confidence values < ݓଵ , … , ݓேೠೞೝೞ > that
indicate the possibility of the pattern belonging to the different clusters. The cluster
confidence vectors produced by the different base classifiers are combined to produce
ே ே
< ݓଵଵ , … , ݓேଵ ೠೞೝೞ , … , ݓଵ ್ , … , ݓேೠೞೝೞ
್
> that forms the input to the fusion classifier. At
output the fusion classifier produces the class confidence values < ߟଵ , … , ߟேೌೞೞ > that
indicate the possibility of the example belonging to different classes. The ensemble classifier
16
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
5. Experimental Setup
We have conducted a number of experiments on benchmark data sets from UCI machine
learning repository [27] to verify the strength of COEC and investigate the research questions
mentioned in Section 1. We have used the same data sets as used in recently published
research [10]–[12][17] so that the results can be easily compared. A summary of the data sets
is presented in Table 1. The Wine data set has well defined training and test sets so as
directed by the description of the data set [27], we have used it as it is. We have used 10–fold
cross validation for reporting the classification results for all the other data sets.
We used the k–means clustering algorithm [26] for partitioning the data sets. Two types of
clustering were performed: (i) heterogeneous clustering: conventional clustering of the entire
data set into k clusters where a cluster can contain examples of more than one class. The
17
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
target for the fusion classifier is set as per the proportions of the class examples within each
cluster; (ii) homogeneous clustering: examples of a single class are partitioned into k clusters.
The target of the fusion classifier is set to the class for which the clustering is performed. We
have reported the impact of both types of clustering on ensemble classifier accuracy and
We have investigated the proposed ensemble classifier by incorporating three well known
and distinct classifiers such as k Nearest Neighbour (k–NN), Neural Network (NN), and
Support Vector Machine (SVM) as the base classifiers. A Neural Network is used as the
fusion classifier. The neural networks for small data sets are trained using a single hidden
layer and tan sigmoid activation functions for the neurons. The Levenberg–Marquardt
backpropagation method is used for learning of the weights in these cases. Larger data sets
are however learned with log sigmoid activation function and gradient descent training
function. We have used the radial basis kernel for SVM and the libsvm library [28] in all the
experiments. The different parameters for the classifiers (e.g. k in k–NN classifier, sigma in
RBF kernel of SVM, and epochs, RMS error goal, learning rate in neural network) were
hand tuned for different data sets. The classification accuracies of bagging, boosting and
random subspace on the data sets in Table 1 are obtained from [17] and WEKA [31]. All the
Given a set of training examples the heterogeneous clustering partitions the entire data set. In
a data set where examples of different classes are well separated in Euclidian space,
18
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
heterogeneous clustering will produce partitions each containing examples from one class
only. We use the term atomic cluster to refer to a partition containing examples from a single
class. Most of the real world data sets however contain overlapping examples from different
classes. It is thus likely to observe mostly non–atomic clusters (clusters containing examples
from multiple classes) when the data set is partitioned using heterogeneous clustering where
the number of clusters equals the number of classes. Figure 9 represents a set of co–
occurrence matrices that are obtained from different data sets by counting the number of
instances of each class belonging to a particular cluster when the data sets are partitioned into
1 2 1 2 1 2 3 1 2 3
1 0 42 45 1 0 32 0
Cluster
Cluster
1 61 82 1 30 32
Cluster
Cluster
2 21 3 0 2 30 1 0
2 141 31 2 57 68
3 24 0 0 3 0 3 24
Note from Figure 9 that in Ionosphere and Sonar data sets each cluster contains examples
from multiple classes. This implies overlapping data points in these data sets. Nearly atomic
and atomic clusters are obtained for the Iris data set at the second and third clusters
respectively. The first cluster however contains overlapping examples from class 2 and class
3. Clustering these data sets into higher number of partitions will lead to higher number of
atomic or nearly atomic clusters leading to better learning of the ensemble classifier. The
clusters produced for the Wine data set are however either atomic or nearly atomic. It is easier
to produce the cluster to class mapping for these clusters by the fusion classifier in COEC.
Clustering further is unlikely to provide any benefit for the ensemble classifier learning for
such data sets. Figure 10 represents the co–occurrence matrices when the Ionosphere, Sonar
and Iris data sets are partitioned into higher number of clusters.
19
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
It can be observed from Figure 10 that the higher number of clusters improves the learning
scenario for all the data sets. Six out of ten clusters in the Ionosphere data set are atomic and
two clusters are near atomic. Four clusters are atomic and three clusters are near atomic for
Sonar data set. All the clusters are either atomic or near atomic for Iris data set. These results
imply that higher number of clusters in heterogeneous clustering produce significant numbers
of atomic and near atomic clusters and it becomes easier for the fusion classifier in COEC to
1 2 1 2 1 2 3
1 0 10 1 17 26 1 0 24 2
2 35 0 2 11 1 2 0 0 10
3 0 7 3 0 11 3 0 20 2
4 108 26 4 18 22 4 12 0 0
5 22 2 5 17 5 5 0 1 16
6 20 38 6 0 9 6 9 0 0
7 0 15 7 4 7 7 0 0 15
Cluster
8 17 1 8 0 8 8 9 0 0
Cluster
Cluster
9 0 8 9 20 5 9 15 0 0
10 0 6 10 0 6
Figure 11 presents the classification accuracies of the datasets in Table 1 at different number
of clusters using heterogeneous clustering in COEC. The best classification accuracies are
obtained for all the data sets at number of clusters greater than the number of classes. As the
clusters have well defined boundaries the base classifiers learn cluster boundaries easily.
Higher number of clusters produces mostly atomic and near–atomic clusters for data sets like
Iris, Ionosphere and Sonar (Figure 10). As a result the fusion classifiers learn the cluster to
class maps with high accuracy resulting in better classification performance of the COEC.
Data sets like Wine has class patterns that are already well separated (Figure 9) and further
clustering does not significantly improve the classification performance of the COEC.
20
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Figure 11: Heterogeneous clustering in COEC at different number of clusters on the test cases
of the datasets in Table 1.
Homogeneous clustering partitions the examples belonging to single class only and ignores
the instances of other classes. Consider the partitioning of the data sets in Figure 9 using
in Figure 12 considering two clusters for each class. The total number of clusters equals the
number of classes time the number of clusters per class. Note that all the clusters are atomic
in nature.
21
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1 2 1 2 1 2 3 1 2 3
1 59 0 1 51 0 1 21 0 0 1 11 0 0
2 143 0 2 36 0 2 24 0 0 2 19 0 0
Cluster
Cluster
3 0 82 3 0 61 3 0 23 0 3 0 18 0
4 0 31 4 0 39 4 0 22 0 4 0 18 0
Cluster
Cluster
5 0 0 20 5 0 0 15
6 0 0 25 6 0 0 9
on the data sets in Table 1 using homogeneous clustering. Here n clusters imply a total of
n×number_of_classes clusters in the data set. For example, the Vehicle data set has four
classes and the four clusters in Figure 13 means 4×4=16 clusters in the data set. Too many
clusters in small data sets imply small number of patterns in each cluster that leads to poor
learning of the fusion classifier in COEC. This explains the fall of accuracy at higher number
6.1.3 Comparison
patterns. For clarification, consider an artificial data set in Figure 14. The data set contains
overlapping patterns from multiple classes. Heterogeneous clustering is likely to produce the
partitions presented in Figure 14(b) where a large cluster is non–atomic. Even with higher
number of partitions the situation is unlikely to change or the produced clusters will be
random with each being non–atomic. The partitions produced by homogeneous clustering
under identical situation are presented in Figure 14(c). Note that all the clusters are atomic in
nature. The groups within each cluster are well separated geometrically for the data set. As
22
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
data is clustered class wise the cluster to class mapping becomes easier by the fusion
Figure 14: Clustering of an artificial data set with overlapping data points using
homogeneous and heterogeneous clustering.
23
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
To verify the above observation we have conducted a set of classification experiments on the
data sets in Table 1 using both homogeneous and heterogeneous clustering with COEC. The
10–fold cross validation results on the test sets are presented in Table 2. It can be observed
average with COEC. These real world data sets contain significantly overlapping patterns and
evidenced from Table 2. To validate this claim, we define the null and alternative hypothesis
as follows:
Note that the Null Hypothesis is rejected at 0.05 significance level by two–tailed sign test
24
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Note that the performance of COEC with clustering depends on the number of clusters. The
accuracy. We have adopted a step wise search method by changing the number of clusters
within a limited range and observing its influence on classification accuracy. The actual
number of clusters is a function of the number of patterns in the data set and it is thus
required that a wider range of number of clusters be considered for finding the optimal
In order to ascertain the impact of clusters on diversity we have computed the errors made by
the base classifiers as we change the number of clusters in COEC. Figure 15 represents the
errors made by k–NN, NN and SVM base classifiers as the number of clusters change. Note
that the base classifier errors at each cluster are different for all the data sets. This is possible
only if the base classifiers make different errors on identical patterns. This implies that the
errors made by the base classifiers are not correlated which in turn refer to the diversity
25
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Figure 15: Change in errors made by base classifiers as the number of clusters change in COEC.
The errors are normalized within a range of zero to one.
Table 3 represents a comparative analysis of the classification performance of COEC and the
corresponding base classifiers. Note that different base classifiers achieve different accuracies
on the data sets. This indicates the fact that the errors made by the base classifiers are
different and diversity among the base classifiers in achieved in COEC. On an average COEC
performs 3.92% better than k–NN, 7.26% better than NN and 7.39% better than SVM as the
base classifiers. The fusion classifier mingles the decisions from the base classifiers to find
the best possible verdict and this can be attributed to the better performance of COEC. In
26
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
order to validate the claims we define the null and alternative hypothesis for each classifier
pair in Table 4. Note that the null hypothesis is rejected at 0.05 significance level by two–
tailed sign test for each classifier pair in Table 4 implying that COEC performs significantly
COEC vs. SVM Null Hypothesis: COEC is equivalent to base SVM classifier
Alternative Hypothesis: COEC is significantly better than the base SVM classifier
Sign-Test: Null Hypothesis rejected at 0.05 significance level from the comparative
classification performances of COEC and SVM in Table 3
We also conducted a classification experiment of the entire data set with the base classifiers
only without any clustering. The classification results are presented in Table 5. COEC
performs 3.62% better than k–NN, 5.33% better than NN and 6.51% better than SVM
classifiers. This implies that clustering has significant impact on the learning of the ensemble
classifier (Section 6.1) leading to overall better performance. We justify this claim by
defining the null and alternative hypothesis in Table 6 for each pair of classifiers. Note that
the null hypothesis is rejected at 0.05 significance level using two–tailed sign test for each
27
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
classifier pair. This implies that clustering significantly impacts the learning in COEC and
Conventional algebraic fusion methods fuse the class confidence values produced by the base
classifiers to produce the class confidence values of the ensemble classifier. In COEC the
base classifiers produce cluster confidence values. If conventional algebraic methods (e.g.
mean of confidence values) are used in COEC the cluster confidence values will be produced
for the ensemble classifier. The cluster–to–class mapping can then be obtained using majority
voting. The class having maximum number of patterns in the cluster will win the vote. This
process is not suitable for strong non–atomic clusters as it undermines the class patterns
28
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
significantly present in the cluster but not in maximum. This will thus impact the overall
classification accuracy. A fusion classifier will perform better under this circumstance. The
targets of the classifier are set according to proportions of class patterns and trained
accordingly. The fusion classifier thus gives importance to all the classes in a cluster
fusion (mean confidence for cluster and majority voting for class) while used with COEC.
Overall the fusion classifier performs 1.08% better than algebraic fusion. This implies that
the use of fusion classifier significantly improves the performance of COEC compared to
algebraic fusion. To justify this claim we define the following null and alternative hypothesis:
Null Hypothesis: Fusion classifier approach is equivalent to algebraic fusion approach while
Alternative Hypothesis: Fusion classifier approach is significantly better than the algebraic
Note that the null hypothesis rejected at 0.05 significance level by two–tailed sign test from
29
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
In order to find the position of COEC we have classified the data sets using classical
ensemble classifiers namely bagging, boosting, and random subspace method. Figure 16
provides a summary of the classification accuracies obtained using COEC and other
ensemble classifiers. On an average COEC performs 6.05% better than bagging, 8.20% better
than boosting and 9.08% better than random subspace method. As mentioned in Section 2 the
classical methods aim to achieve diversity and do not provide any mechanism to improve the
learning performance of base classifiers. In COEC this issue is handled by first allowing the
base classifier to learn cluster boundary. As clusters have well defined boundaries it is easier
to learn by the base classifiers. The fusion classifier performs the cluster to class mapping and
as observed in the previous section it performs better than the conventional fusion methods.
This combination of cluster boundary learning and fusion classifier mapping leads to better
performance of COEC. We justify this claim by conducting a sign test as presented in Table 8.
Note that the null hypothesis is rejected in all cases either at 0.05 or 0.15 significance level
indicating the fact that COEC performs significantly better than the conventional ensemble
classifiers.
30
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Figure 16: Classification performance comparison between COEC and classical ensemble
classifiers.
COEC vs. random Null Hypothesis: COEC is equivalent to random subspace method
subspace method Alternative Hypothesis: COEC is significantly better than random subspace method
Sign-Test: Null Hypothesis rejected at 0.15 significance level from the comparative
classification performances of COEC and random subspace method in Figure 16
7. Conclusion
We have presented a novel cluster oriented ensemble classifier (COEC) which is based on
learning of cluster boundaries by the base classifiers leading to better learning capability and
The proposed COEC has been evaluated on benchmark data sets from UCI machine learning
repository. The detailed experimental results and their significance using two-tailed sign test
31
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
have been presented and analysed in Section 6. The evidence from the experimental results
and two–tailed sign test show that (i) homogeneous clustering performs significantly better
than heterogeneous clustering with COEC. As shown in Section 6.1, overall the
homogeneous clustering performs 14.38% better than heterogeneous clustering. (ii) the
proposed COEC performs significantly better than its base counterparts. As shown in Section
6.3, overall COEC performs 3.62% better than k–NN, 5.33% better than NN and 6.51% better
than SVM classifiers. (iii) fusion classifier performs significantly better than algebraic fusion
with COEC. As shown in Section 6.4, overall the fusion classifier performs 1.08% better than
algebraic fusion. (iv) COEC outperforms classical ensemble classifiers namely bagging,
boosting and random subspace method significantly on benchmark data sets. As shown in
Section 6.5, overall COEC performs 6.05% better than bagging, 8.20% better than boosting
In our future research, we would like to focus on finding the optimal number of clusters and
References
[1] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and Systems
161–168, 2006.
[4] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
[5] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.
32
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
[7] R. E. Schapire, “The strength of weak learnability,” Machine Learning,” vol. 5, no. 2, pp.
197–227, 1990.
an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp.
119–139, 1997.
selection,” IEEE Transaction on Neural Networks, vol. 20, no. 2, pp. 258–277, 2009.
[13] L. Nanni and A. Lumini, “Fuzzy bagging: a novel ensemble of classifiers,” Pattern
[14] L. Chen and M. S. Kamel, “A generalized adaptive ensemble generation and aggregation
approach for multiple classifiers systems,” Pattern Recognition, vol. 42, pp. 629–644,
2009.
33
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
[15] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE
Transaction on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239,
1998.
[16] L. I. Kuncheva, J. C. Bezdek, and R. Duin, “Decision templates for multiple classifier
fusion: An experimental comparison,” Pattern Recognition, vol. 34, no. 2, pp. 299–314,
2001.
[17] A. H. R. Ko, R. Sabourin, A. de S. Britto, and L. Oliveira, “ Pairwise fusion matrix for
[18] N. M. Wanas, R. A. Dara, and M. S. Kamel, “ Adaptive fusion and co-operative training
for classifier ensembles,” Pattern Recogntion, vol. 39, pp. 1781–1794, 2006.
[20] D. Parikh and R. Polikar, “Ensemble based incrimental learning approach to data fusion,”
IEEE Transaction on Systeams, Man, and Cybernetics, vol. 37, no. 2, pp. 437–450, 2007.
of new classes,” IEEE Transaction on Neural Networks, vol. 20, no. 1, pp. 152–168,
2009.
[22] R. Maclin and J. W. Shavlik, “Combining the Predictions of Multiple Classifiers: Using
34
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
[25] A. Strehl and J. Ghosh, “Cluster ensembles – a knowledge reuse framework for
combining multiple partitions,” The Journal of Machine Learning Research, vol. 3, pp.
583–617, 2003.
February 2010.
[29] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” Journal of
WEKA Data Mining Software: An Update,” SIGKDD Explorations, vol. 11, no. 1, 2009.
35
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Author Biography
pattern recognition and computational intelligence. He has published thirteen books, seven
book chapters and over hundred papers in journals and conference proceedings. He has
received twelve competitive research grants and supervised thirty one research students in the
areas of pattern recognition and computational intelligence. He has served on the program
committees of over thirty international conferences and editorial boards of six international
journals. He is a Senior Member of IEEE and has served as a Chair of IEEE Computational
Ashfaqur Rahman received his Ph.D. degree in Information Technology from Monash
University, Australia in 2008. Currently, he is a Research Fellow at the Centre for Intelligent
and Networked Systems (CINS) at Central Queensland University (CQU), Australia. His
major research interests are in the fields of data mining, multimedia signal processing and
journal articles and conference papers. Dr. Rahman is the recipient of numerous academic
awards including CQU Seed Grant, the International Postgraduate Research Scholarship
(IPRS), Monash Graduate Scholarship (MGS) and FIT Dean Scholarship by Monash
University, Australia.
36