Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
This paper presents four schemes for soft fusion of the outputs of multiple classi®ers. In the ®rst three approaches,
the weights assigned to the classi®ers or groups of them are data dependent. The ®rst approach involves the calculation
of fuzzy integrals. The second scheme performs weighted averaging with data-dependent weights. The third approach
performs linear combination of the outputs of classi®ers via the BADD defuzzi®cation strategy. In the last scheme, the
outputs of multiple classi®ers are combined using Zimmermann's compensatory operator. An empirical evaluation
using widely accessible data sets substantiates the validity of the approaches with data-dependent weights, compared to
various existing combination schemes of multiple classi®ers. Ó 1999 Elsevier Science B.V. All rights reserved.
0167-8655/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 0 1 2 - 4
430 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
Tresp and Taniguchi, 1995) or can express the dice.ucl.ac.be in the directory pub/neural/ELENA/
worth averaged over the entire data space (Sollich databases. From the ELENA project we have
and Krogh, 1996; Krogh and Vedelsby, 1995). The chosen two data sets representing arti®cial data,
use of data-dependent weights, when properly es- Clouds and Concentric, and two sets representing
timated, provides higher classi®cation accuracy real applications, Phoneme and Satimage.
(Woods et al., 1997; Verikas et al., 1998). The Clouds data are two-dimensional. There are
In practice, outputs of multiple classi®ers are two a priori equally probable classes in the data
usually highly correlated. Therefore, it is desirable set. The graphical representation of the data set is
to assign weights not only to individual classi®ers, given in Fig. 1. The Clouds data classes are rela-
but to groups of them also. This would express tively highly overlapped. The class boundaries are
interaction between dierent classi®ers. Combina- highly nonlinear. The class x0 is the sum of three
tion schemes based on fuzzy integrals possess this dierent Gaussian distributions
valuable property (Cho and Kim, 1995; Gader et al.,
1 p1
x p2
x
1996; Verikas et al., 1998). p
x j x0 p3
x ;
1
2 2 2
In spite of the variety of combination schemes
proposed, only a small amount of eort was made where
to compare the dierent approaches using widely !
2 2
1
x ÿ mjx
x ÿ mjy
available databases. The objective of this work is pj
x exp ÿ 2
ÿ ;
to present methods for improving accuracy of 2prjx rjy 2rjx 2r2jy
multiple neural classi®er systems as well as to
2
compare the proposed approaches with various
where mjx and mjy are the x and y means of the j's
existing combination schemes of multiple classi®-
distribution and rjx , rjy are the corresponding
ers on widely accessible databases. We propose
standard deviations. Table 1 shows the parameters
four schemes for soft fusion of outputs of multiple
of the distributions of class x0 .
classi®ers. The ®rst approach involves the calcu-
The class x1 is a single Gaussian distribution:
lation of fuzzy integrals. The second scheme per-
forms weighted averaging with data-dependent 1 x2 y 2
p
x j x1 exp ÿ :
3
weights. The third approach performs linear 2p 2
combination of the outputs of classi®ers via the
The theoretical error is 9.66%. The error ob-
BADD defuzzi®cation (Basic Defuzzi®cation Dis-
tained in the ELENA project when using a single
tributions (Filev and Yager, 1991)) strategy. In the
last scheme, the outputs of multiple classi®ers are
combined using Zimmermann's compensatory
operator. We begin with a short description of the
databases used. Next, we brie¯y describe the var-
ious combination schemes used for comparison as
well as our proposed approaches. Finally, the ex-
perimental procedures and results for four dier-
ent data sets are presented.
2. Data
Table 1 Table 2
Parameters of the distributions of class x0 Summary of the Satimage database
j mjx mjy rjx rjy Class Description Number of Percentage
label instances
1 0.0 0.0 0.2 0.2
2 0.0 2.0 0.2 0.2 1 Red soil 1533 23.82
3 2.0 1.0 0.2 1.0 2 Cotton crop 703 10.92
3 Grey soil 1358 21.10
4 Damp grey soil 626 9.73
multilayer perceptron (MLP) is 12.3%. In all the 5 Soil with vegetation 707 10.99
experiments presented in the ELENA project, the stubble
MLP used contained two hidden layers of 20 and 6 Very damp grey soil 1508 23.43
10 units, respectively.
The Concentric data are two-dimensional with and 255 to white. The Satimage database is a sub-
two classes and uniform concentric circular dis- area of a scene, consisting of 82 ´ 100 pixels. The
tributions. The points of the class x0 are uniformly database contains 6435 36-dimensional patterns (4
distributed into a circle of radius 0.3 centred on spectral bands ´ 9 pixels in a neighbourhood).
(0.5, 0.5). The points of the class x1 are uniformly There are six classes in the database. Table 2 gives
distributed into a ring centred on (0.5, 0.5) with a summary of the classes. The database also con-
internal and external radii equal to 0.3 and 0.5, tains the ®ve-dimensional description of the data.
respectively. There are 2500 instances, with 1579 in The ®ve dimensions were obtained by using dis-
class x1 and the remainder in class x0 . There is no criminate factorial analysis. The ®ve-dimensional
overlap between the classes. The theoretical error description of the Satimage data was used in this
is 0%. The mean error, reported in the ELENA study.
project is 2.8%. The graphical representation of The Phoneme database. The aim of this data-
the Concentric data set is given in Fig. 2. base is to distinguish between nasal and oral
The Satimage database was generated from vowels. Thus, there are two dierent classes: the
Landsat Multi-Spectral Scanner image data. One Nasals in class x0 and the Orals in class x1 . This
frame of the Landsat MSS imagery consists of database contains vowels coming from 1809 iso-
four digital images of the same scene in dierent lated syllables. Five features were chosen to char-
spectral bands. The spatial resolution of a pixel is acterise each vowel. The features are the
about 80 m ´ 80 m. Each image contains 2340 ´ amplitudes of the ®ve ®rst harmonics normalised
3380 such pixels, with 0 corresponding to black by the total energy integrated on all the frequen-
cies. Each harmonic is signed either positive when
it corresponds to a local maximum of the spectrum
or negative otherwise (Jutten et al., 1995). There
are 3818 patterns from class x0 and 1586 patterns
from class x1 .
The data sets used are summarised in Table 3.
The ELENA errors presented in the table are ta-
ken from the ELENA project. The errors pre-
sented are the average errors obtained in the
ELENA project while using MLP of the above-
mentioned structure.
Table 3
Summary of the data sets used
Data set Number of classes Number of features Number of samples ELENA error % Bayes error %
Clouds 2 2 5000 12.3 1.1 9.66
Concentric 2 2 2500 2.8 0.7 0
Phoneme 2 5 5404 16.4 1.3 Not available
Satimage 6 5 6435 11.9 1.0 Not available
X
L X
L
Q
L
wi wj cij :
14
P
x 2 qi j kk
x jk i1 j1
k1
Bel
i ; 1 6 i 6 Q:
Q Q
P L Optimal values for the weights wi can be deter-
P
x 2 qi j kk
x jk
i1 k1
mined by the minimisation of e. APnon-trivial
L
8 minimum can be found by requiring i1 wi 1.
The solution for the wi is
The ®nal decision is then made according to the PL ÿ1
following rule: j1
C ij
wi P L P L ÿ1
:
15
k1 j1
C kj
x 2 qi ; if Bel
i P Bel
j; 8j 6 i:
9
One problem with the constraint imposed is
that it does not prevent the weights from adopting
3.5. Weighted averaging
large negative or positive values. The inverse of C
can be unstable. Redundancy in members of a
For notational convenience, we consider net-
committee leads to linear dependencies in the rows
works with a single output yi , although the gen-
and columns of C. We might, therefore, seek to
eralisation to several outputs is straightforward.
constrain the weights further by requiring that
We denote the true output of a network by d
x.
wi P 0; 8i 1; . . . ; L:
Thus, the actual output of each network can be
The generalisation error of a committee can be
written as the desired output plus an error:
decomposed into the sum of two terms (Krogh and
yi
x d
x ei
x:
10 Vedelsby, 1995):
The average squared error for the ith network can h i
2
be written as e E f y
x ÿ d
xg
h i X h i
2 2
ei E fyi
x ÿ d
xg E e2i ;
11 wi E fyi
x ÿ d
xg
i
X h i
where E denotes the expectation. 2
ÿ wi E fyi
x ÿ y
xg :
16
A weighted combination of outputs of a set of i
L networks (a committee output) can be written
as The ®rst term depends only on the errors of
individual networks, while the second term de-
X
L X
L
pends on the spread of outputs of the committee
y
x wi yi
x d
x wi ei
x;
12
members. We see that the committee error will
i1 i1
decrease, if we can increase the spread of outputs
where the weights wi need to be determined. If C is of the committee members without increasing the
the error correlation matrix given by errors of the members themselves. Using that the
weights sum to one, Eq. (16) can be written as
cij E ei
x ej
x
(Krogh and Vedelsby, 1995)
1X N
X X X
yi
xn ÿ d
xn
yj
xn ÿ d
xn ;
13 e w k ek wl ckl wk ÿ wk ckk ;
17
N n1
k k;l k
where N is the number of samples, the error due to where ek is the estimate of the error of the
committee can be written as kth network. Using estimates of the errors of
434 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
23
3.6. Combination via fuzzy integral
where indices i have been permuted so that
0 6 h
z1 6 6 h
zL 6 1, Ai fzi ; . . . ; zL g and
3.6.1. Background on fuzzy integral and fuzzy
h
z0 0.
measure
3.6.2. Combination by Choquet integral
De®nition 1. A set function g : 2Z ! 0; 1 is a
We adopted Sugeno's k-fuzzy measure and as-
fuzzy measure if
signed the fuzzy densities gi , the degree of impor-
1. g
0 0; g
Z 1;
tance of each network, based on the performance
2. if A; B 2Z and A B; then; g
A 6 g
B;
of the networks on validation data. We computed
3. if An 2Z for 1 6 n < 1 and the sequence fAn g
the densities as follows:
is monotone in the sense of inclusion, then
pi
g i PL dS ;
24
lim g
An g lim An :
n!1 n!1 j1 pj
In general, the fuzzy measure of a union of two where pi is the performance of the ith network
disjoint subsets cannot be directly computed from and dS is the desired sum of fuzzy densities. We
the fuzzy measures of the subsets. Sugeno (1977) assumed that committee members have Q outputs
introduced the decomposable so called k-fuzzy representing Q classes, and data point x needs to
measure satisfying the following additional prop- be assigned into one of the classes. The class label
erty: c for the data point x is then determined as fol-
lows:
g
A [ B g
A g
B kg
Ag
B
19
c arg max Cg
q;
25
for all A; B Z and A \ B 0, and for some q1;...;Q
k > ÿ1. where Cg
q is the Choquet integral for the class q.
Let Z fz1 ; z2 ; . . . ; zL g be a ®nite set (a set of The values of function h
z that appear in the
committee members in our case) and let Choquet integral are given by the output values of
gi g
fzi g. The values gi are called the densities members of the committee (an evidence provided
of the measure. The value of k is found from the by the members).
equation g
Z 1, which is equivalent to solving
the following equation:
3.7. Combination by fuzzy integral with data-
Y
L
dependent densities
k1
1 kgi :
20
i1
We use Sugeno's k-fuzzy measure and the
When g is the k-fuzzy measure, the values of g
Ai Choquet integral. This approach assumes that
can be computed recursively as follows: data space is partitioned into K regions with
A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444 435
where Cgk
q is the Choquet integral for class q The data-dependent weights are determined in
calculated in region k. The index of the region is the same way as the data-dependent densities for
determined in the following way: the Choquet integral. The output of the combiner
k arg min d
x; vi ;
28 is given by
i1;...;K
X
L
with d
x; vi being the Euclidean distance between q arg max wik yij ;
34
j1;...;Q
the data point x and vector vi representing the ith i1
region. where yij is the jth output of the ith network and
the weights wik are calculated in the same way as
3.7.1. Frequency-sensitive competitive learning gij in Eq. (26). The region index k is given by
The learning algorithm is encapsulated in the Eq. (28).
following six steps.
1. Initialise the weight vectors vi
0 of all the K 3.9. Combination by the BADD defuzzi®cation
nodes with small random values. strategy
2. For a given data point x ®nd a weight vector
vj
t yielding a minimum distance: This approach assumes that data space is par-
d
x; vj min kvi
t ÿ xk ; 2
i 1; 2; . . . ; K: titioned into L overlapping regions and each net-
i work is associated with one of the regions.
29 Decisions from the networks are now combined
3. Calculate according to the following rule:
where yi
x is the output of the ith network, d is a desired outputs. The constraints on c and wi can be
parameter, and li
x is the membership degree of eliminated by modifying the de®nition of the pa-
given x in the ith region of the data space. The rameters as follows (Krishnapuram and Lee, 1992):
membership degrees are calculated in the following
a2
way: c ;
38
a 2 b2
1
li
x p;
36 Ld 2
1 d
x; wi wi PL i :
39
where wi is the weight vector representing the ith k1 dk2
region of the data space, p is a constant, and Now a, b and di can be chosen without any con-
d
x; wi is the Euclidean distance between vectors straints.
x and wi .
This combination scheme actually is the BADD 3.11. Optimising the fuzzy measure
defuzzi®cation strategy. The parameter d deter-
mines the type of defuzzi®cation (Yager and Filev, Parameters of the k-fuzzy measure we used in
1994). The arithmetic mean, the centre of area and Sections 3.6 and 3.7 were subjectively estimated
the mean of maximum combination schemes are from performance of the networks without further
obtained by changing parameter d. After some optimisation. Such an approach to estimation of
experiments, the parameter d has been chosen to the parameters of the k-fuzzy measure is often used
be d 3. in practice (Cho and Kim, 1995; Chi et al., 1996).
As pointed out by one of the reviewers of the
3.10. Combination by Zimmermann's compensatory manuscript, the comparison between the weighted
operator averaging and the Choquet integral would be more
level, if the Choquet integral were optimised in a
Fuzzy integrals are weighted aggregation oper- fashion similar to that used for the weighted av-
ators whose weights are de®ned not only on the eraging. It is interesting to compare the weighted
dierent elements being aggregated, but also on all averaging and the fuzzy integral on an equal
subsets of them. This allows thus the representa- footing, since the weighted averaging is in fact the
tion of redundancy and support among the ele- fuzzy integral with a special kind of fuzzy measure.
ments. It is interesting, therefore, to compare The discrete Choquet integral can be written as
aggregation by fuzzy integrals with aggregation by XL
fuzzy operators of another type. We have chosen Cg h
zi fg
Ai ÿ g
Ai1 g;
40
the Zimmermann's compensatory operator for this i1
purpose (Zimmermann and Zysno, 1980, 1984): where g
AL1 0. The fuzzy measure can be
!1ÿc !c chosen so that the measure of a set depends only
YL
wi
YL
wi on the size of the set (Chen et al., 1997):
y
x
yi
x 1 ÿ
1 ÿ yi
x ;
i1 i1 X
iÿ1
Let z
z1 ; z2 ; . . . ; zL be a vector. The ith order ®ve parts when training members of the commit-
statistic z
i of z is the ith smallest element of z tee. We use one data part, which is dierent for
where z
1 6 z
2 6 6 z
L . Let w
w1 ; w2 ; . . . ; each member of the committee, for ``early stop-
PLL be a weight vector constrained so that
w ping''. The other parts are used for training.
i1 wi 1 and 0 6 wi 6 1; 8i 1; 2; . . . ; L. Therefore, the members are trained on partially
The linear combination of order statistics of z disjoined learning sets. We use the entire HO
z1 ; z2 ; . . . ; zL with the weight vector w training part of data to estimate the combination
w1 ; w2 ; . . . ; wL is de®ned as (Chen et al., 1997) weights or fuzzy densities. Ten reference points are
X
L used to compute data-dependent densities and
LOS
z; w wi z
i :
44 weights. We present the combination results for
i1 two types of training: with early stopping (ES) ±
Thus, the weighted averaging and the LOS the middle three columns of the tables given be-
operators are in fact Choquet integral operators. low; and saturated learning (SL) ± the three col-
Therefore, we included the optimised version of umns on the right-hand side of the tables.
the Choquet integral with the k-fuzzy measure and
the LOS operator in our comparisons. The opti- 4.1. The Clouds data
misation of the Choquet integral has been done in
the same manner as of the weighted averaging For the Clouds data, training of each MLP was
approach. The optimal weights for the LOS op- repeated 10 times with dierent initialisations. The
erator are found by minimising the following best outcome of the 10 trials was used when testing
quantity: the combination schemes. Therefore, the result
( )2 presented in Tables 4 and 5 as ``The best single
XN XQ XL
NN'' is actually the best result selected from the 50
wi zj
i
xn ÿ dj
xn ;
45 trials (10 trials for each NN ´ 5). Due to such a
n1 j1 i1
careful selection, the members of the committee
where N is the number of data samples, Q is the were very even in their performances. Table 4
number of classes, z and d stand for the neural presents performance evaluation results of various
network output value and the target value, re- schemes of combination for the Clouds data, when
spectively. This is a quadratic objective function 2000 and 500 data points were included in the
with linear constraints. The minimisation is done learning and early stopping sets (if used), respec-
using quadratic programming (Grabisch and Ni- tively.
colas, 1994). As can be seen from Table 4, there is no im-
provement in the classi®cation accuracy when
combining regularised (by early stopping) net-
4. Experimental testing works. Only a combination by the BADD de-
fuzzi®cation strategy gives a little rise in
All comparisons between dierent classi®ers in classi®cation accuracy. Encountering such behav-
the ELENA project have been done using the iour of the combined network, one can expect the
Holdout (HO) method with equal training and outputs of the separate networks to be highly
testing parts of the data. To make the comparisons correlated. The computed mean value of the cor-
feasible we have also used the same method to relation coecients between the outputs of the
estimate the classi®cation accuracy. dierent networks substantiated that this is indeed
In all the tests presented here, we train ®ve one- the case. The mean value of the correlation
hidden layer MLPs with 10 sigmoidal hidden units coecient was found to be q 0:787. When
using the error-backpropagation algorithm. We combining networks by the BADD defuzzi®cation
run each experiment eight times, and the min, strategy, we train separate networks in the dier-
mean and max errors present are calculated from ent regions of the input space. This leads to more
these eight trials. We divide the learning set into diverse outputs of the networks. Therefore, a slight
438 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
Table 4
Performance of various schemes of combination for the Clouds data when training sets consist of 2000 instances
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 10.7 11.0 11.2 11.0 11.2 11.3
Majority rule 11.0 11.1 11.3 10.9 11.0 11.3
Borda count 11.0 11.1 11.3 10.9 11.0 11.3
Averaging 11.0 11.1 11.2 10.9 11.0 11.5
Bayesian 10.6 11.0 11.3 10.8 10.9 11.1
Zimmermann's operator 10.5 11.1 11.2 10.9 11.1 11.2
Weighted averaging (WA) 10.6 10.9 11.1 10.8 11.0 11.2
WA with dd weights 10.6 11.0 11.2 10.7 11.2 11.4
Choquet integral (CI) 10.8 11.1 11.2 10.9 11.0 11.2
Optimised Choquet integral 10.7 11.0 11.1 10.8 11.0 11.2
LOS 10.7 11.0 11.1 10.8 11.0 11.4
CI with dd densities 10.5 11.0 11.2 10.9 11.1 11.4
BADD 10.7 10.8 10.9 10.8 11.2 11.5
The abbreviation ``dd'' used in the table stands for ``data dependent''.
Table 5
Performance of various combination schemes for the Clouds data when training sets consist of 150 instances
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 12.8 13.1 13.8 12.5 13.7 15.9
Majority rule 12.0 12.3 12.6 12.0 12.7 13.5
Borda count 12.0 12.3 12.6 12.0 12.7 13.5
Averaging 11.8 12.0 12.9 11.8 12.5 13.0
Bayesian 12.0 12.3 12.6 11.9 12.7 13.3
Zimmermann's operator 11.5 12.0 12.5 12.1 13.0 14.0
Weighted averaging (WA) 11.6 12.1 12.6 12.2 12.7 13.3
WA with dd weights 11.3 11.5 11.8 11.4 12.0 12.5
Choquet integral (CI) 11.4 11.8 12.4 11.9 12.5 13.4
Optimised Choquet integral 11.4 11.8 12.3 11.9 12.5 13.3
LOS 11.6 12.2 12.9 12.0 12.4 13.0
CI with dd densities 10.9 11.4 12.1 11.5 11.9 12.2
BADD 11.7 12.1 12.4 13.0 14.1 15.6
The abbreviation ``dd'' used in the table stands for ``data dependent''.
improvement in classi®cation accuracy is obtained. cient for the saturated learning was q 0:938.
On the other hand, the obtained classi®cation er- Thus, in the SL case, the networks converge al-
ror is quite close to the theoretical one. Therefore, most to the same solution.
any improvements are hardly achieved. The com- In practice, learning sets are usually very lim-
parison of the middle and the right-hand parts of ited. Therefore, in the next test, we reduced the
Table 4 shows that there is only a small dierence learning sets to 150 samples. Table 5 summarises
between the results of the early stopped and the the results of the test. As can be seen from the
saturated learning. This is not surprising, since the table, all the combination schemes improve clas-
number of training samples exceeds the number of si®cation performance when the early stopping is
weights nearly 40 times. Amari et al. (1996) points applied. In the saturated learning case, the use-
out that for large networks if the number of fulness of the BADD combination scheme can be a
training samples exceeds the number of weights point of contention. The BADD training scheme
more than 30 times, the early stopping is not ef- divides the input space into several regions. It
fective. The mean value of the correlation coe- seems that for the small training sets, reduction of
A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444 439
the size of the training regions may deteriorate the tion. Table 6 shows that this is the case. The op-
ability of the networks to generalise. timised Choquet integral, the LOS operator and
The mean values of the correlation coecient the weighted averaging approach yielded the best
between the outputs of the dierent networks were performance. The BADD combination scheme,
q 0:839 and q 0:674 for the ES and SL cases, again, performs well for the regularised networks
respectively. If the values of the correlation coef- trained on the large learning sets. For the small
®cients for the ES and SL cases are compared, it is training sets, the proposed combination schemes
possible to see that, for the large training set, SL with data-dependent weights are, once again,
increases correlation between the networks. For among the best approaches for networks fusion
the small training set, by contrast, SL decreases the (the values in bold in the tables). The obtained
correlation. Combining less correlated networks performance of the committee can be compared
may increase the performance of the committee. with the 2:8 0:7% error rate achieved in the
For this data, however, this increase does not ELENA project using MLP of the above-men-
compensate the performance decrease that occurs tioned structure. For the Concentric data, the
due to over-training. saturated learning altered the correlation between
Table 5 shows that the proposed approaches the outputs of the dierent networks in the same
for combining multiple networks (the shadowed way as for the Clouds data. The saturated learning
lines in the table) yielded the best performances. increased the correlation coecient for the large
The performances achieved can be compared with training sets and decreased the coecient for the
the 12.3% error rate obtained in the ELENA small training sets.
project. Note that the network used in the ELE-
NA project had 292 weights and was trained on 4.3. The Satimage data
2500 data points. By contrast, each network in this
study was trained on 150 data points and had 52 There are six classes of data with signi®cantly
weights (260 weights in total for the ®ve net- diering number of instances in the Satimage da-
works). tabase. The members of the committee trained on
the large data sets were also rather diverse in their
4.2. The Concentric data performances. The accuracy of the leader was
much higher than the accuracy of the other
For the Concentric data, by contrast, only one members of the committee. We included an equal
trial was performed when training each MLP. number of instances from the dierent classes (25)
Therefore, members of the committee were much when constructing the small learning sets. For the
more diverse in their performances for the small training sets, performances of the committee
Concentric data than for the Clouds data. The best members were less diverse than for the large
member of the committee performed much better training sets.
than the others. Tables 6 and 7 summarise the The performance evaluation results for the
results of the tests for the Concentric data when Satimage data are given in Tables 8 and 9. As can
learning sets contained 1000 and 100 training be seen from Table 8 the early stopping is again
samples, respectively. As can be seen from Ta- not eective. The over-training is hardly observed
bles 6 and 7, the committee of the diverse members on the best single network. Moreover, except for
combined by the Majority rule or the Borda count the Majority rule, all the combination schemes
performs much worse than the best member of the perform better when combining the ``over-
committee. For the large training sets, a very low trained'' networks. Note that the saturated
classi®cation error was obtained from the best learning has raised the correlation between the
single network. Therefore, we can expect that dierent networks for the large training sets and
combination schemes that use optimisation tech- has lowered the correlation for the small training
niques to estimate the combination weights will sets. The results obtained once again point out
perform better than those without any optimisa- that the Majority rule and the Borda count are
440 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
Table 6
Performance of various combination schemes for the Concentric data when training sets consist of 1000 samples
Combination schemes\Errors % Min Mean Max Min Mean Max
The best single NN 0.3 0.5 0.7 0.3 0.6 0.8
Majority rule 0.3 1.9 8.0 0.3 2.2 9.3
Borda count 0.3 1.9 8.0 0.3 2.2 9.3
Averaging 0.2 0.4 0.7 0.2 0.5 0.8
Bayesian 0.3 1.7 7.0 0.2 1.8 6.9
Zimmermann's operator 0.3 0.4 0.5 0.3 0.4 0.8
Weighted averaging (WA) 0.3 0.4 0.5 0.3 0.4 0.5
WA with dd weights 0.3 0.6 0.9 0.3 0.5 0.9
Choquet integral (CI) 0.3 0.5 0.7 0.4 0.5 0.8
Optimised Choquet integral 0.2 0.4 0.5 0.3 0.4 0.5
LOS 0.2 0.4 0.5 0.2 0.4 0.6
CI with dd densities 0.3 0.4 0.6 0.3 0.5 0.7
BADD 0.2 0.4 0.6 0.3 0.6 0.9
The abbreviation ``dd'' used in the table stands for ``data dependent''.
Table 7
Performance of various combination schemes for the Concentric data when training sets consist of 100 samples
Combination schemes\Errors % Min Mean Max Min Mean Max
The best single NN 2.2 3.1 3.3 3.0 4.1 5.2
Majority rule 2.0 9.7 36.3 2.0 7.0 23.7
Borda count 2.0 9.7 36.3 2.0 7.0 23.7
Averaging 2.1 2.7 3.3 2.1 2.4 3.2
Bayesian 2.0 3.7 5.8 2.2 8.0 23.4
Zimmermann's operator 1.3 2.3 3.0 1.9 2.4 2.8
Weighted averaging (WA) 1.3 2.1 2.8 1.8 2.3 2.6
WA with dd weights 1.0 1.5 2.0 1.6 2.0 2.2
Choquet integral (CI) 1.6 2.5 3.1 2.0 2.6 4.0
Optimised Choquet integral 1.4 2.1 2.9 1.8 2.3 2.6
LOS 1.9 2.6 3.2 2.2 2.5 3.1
CI with dd densities 1.2 2.1 2.5 1.5 2.0 2.4
BADD 3.2 3.5 4.4 3.7 4.5 5.4
The abbreviation ``dd'' used in the table stands for ``data dependent''.
Table 8
Performance for the Satimage data when training sets consist of 2500 samples
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 12.2 14.1 20.0 12.5 14.0 20.3
Majority rule 13.3 15.3 18.6 13.7 18.4 30.1
Borda count 15.6 19.2 27.5 15.4 18.5 24.8
Averaging 12.0 13.1 15.1 11.7 12.5 13.9
Bayesian 12.3 14.2 19.0 12.5 12.7 12.9
Zimmermann's operator 11.9 13.0 14.2 11.6 11.8 12.1
Weighted averaging (WA) 12.2 12.7 13.8 11.6 12.1 12.7
WA with dd weights 11.8 12.1 12.4 11.6 11.8 12.0
Choquet integral (CI) 11.9 12.2 12.8 11.6 11.7 12.0
Optimised Choquet integral 11.9 12.2 12.8 11.4 11.7 12.0
LOS 11.9 12.5 14.0 11.5 11.9 12.6
CI with dd densities 11.9 12.1 12.8 11.3 11.7 12.1
BADD 12.1 12.5 13.2 11.3 11.6 12.0
The abbreviation ``dd'' used in the table stands for ``data dependent''.
A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444 441
Table 9
Performance of various combination schemes for the Satimage data when training sets consist of 150 samples
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 14.1 14.4 15.0 16.0 16.5 17.0
Majority rule 12.9 13.9 15.3 13.7 14.1 14.4
Borda count 13.0 14.8 18.0 14.0 14.4 14.8
Averaging 12.4 13.4 14.8 13.1 13.5 13.7
Bayesian 13.5 14.8 15.7 14.5 15.4 15.8
Zimmermann's operator 12.3 13.1 14.6 13.0 13.3 13.9
Weighted averaging (WA) 12.6 13.3 14.8 12.8 13.4 13.8
WA with dd weights 12.2 12.8 13.8 12.5 13.0 13.5
Choquet integral (CI) 12.8 13.8 15.2 13.4 13.8 13.9
Optimised Choquet integral 12.6 13.3 14.8 12.8 13.3 13.8
LOS 12.6 13.4 14.5 12.9 13.4 13.7
CI with dd densities 12.3 13.3 14.1 13.4 13.6 13.7
BADD 13.5 14.0 15.0 13.4 14.1 14.8
The abbreviation ``dd'' used in the table stands for ``data dependent''.
bad choices for combining the diverse members of provement in classi®cation accuracy if compared
the committee. with the best single network.
Though there is no signi®cant dierence be-
tween the results obtained from the several com- 4.4. The Phoneme data
bination schemes, the proposed approaches are
among the best. It should be mentioned, however, The performance evaluation results for the
that the BADD combination scheme is not useful Phoneme data are given in Table 10. Though the
if only a small number of training samples is dierent combination schemes show very similar
available. The maximum 12.0% error rate ob- performances on the Phoneme data, the lowest
tained can be compared with the 12.9% error rate classi®cation error was achieved by using the
achieved in the ELENA project when using a weighted averaging with data-dependent weights.
single network of the above mentioned architec- We recall that the error rate achieved in the
ture. The committee shows a quite impressive im- ELENA project was equal to 16:4 1:3%. Again,
Table 10
Performance of various combination schemes for the Phoneme data when training sets consist of 2500 samples
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 15.4 15.8 16.2 15.6 16.0 16.5
Majority rule 14.5 15.3 16.5 15.0 15.5 16.2
Borda count 14.5 15.3 16.5 15.0 15.5 16.2
Averaging 14.8 15.3 15.9 14.7 15.2 16.0
Bayesian 14.9 15.6 16.5 15.3 15.6 16.0
Zimmermann's operator 14.8 15.2 15.5 14.8 15.2 16.0
Weighted averaging (WA) 14.8 15.2 15.4 14.8 15.2 15.8
WA with dd weights 14.6 15.1 15.5 13.0 14.8 15.0
Choquet integral (CI) 14.6 15.3 15.6 14.6 15.3 16.2
Optimised Choquet integral 14.7 15.2 15.3 14.6 15.2 15.8
LOS 14.7 15.2 15.6 14.9 15.3 16.0
CI with dd densities 14.8 15.2 15.6 14.3 15.1 16.0
BADD 14.7 15.2 15.5 14.2 15.0 15.3
The abbreviation ``dd'' used in the table stands for ``data dependent''.
442 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
the BADD combination scheme performed well weighted averaging approach was very close to
for the large training sets. that obtained from the optimised Choquet inte-
gral. On average, the regularisation applied im-
proved performance of the committees.
5. Discussion and conclusions Zimmermann's compensatory operator and the
LOS operator have shown a slightly lower per-
We have presented four schemes for soft fusion formance than the optimised Choquet integral and
of the outputs of multiple neural classi®ers. The the weighted averaging approach. The optimised
weights assigned to the classi®ers or groups of Choquet integral yielded a lower error rate than
them are data dependent in the ®rst three ap- the one with the subjectively determined fuzzy
proaches. The ®rst approach involves the calcula- densities. Therefore, we can expect to further im-
tion of fuzzy integrals. The second scheme prove the approaches with data-dependent weights
performs weighted averaging with data-dependent and fuzzy densities by using optimisation tech-
weights. The third scheme performs linear combi- niques to estimate the combination weights.
nation of the outputs of classi®ers via the BADD Our conclusion about the usefulness of the
defuzzi®cation strategy. In the last scheme, the Majority rule, the Borda count, and the Bayesian
outputs of multiple classi®ers are combined using approaches for fusing correlated and diverse in
Zimmermann's compensatory operator. An em- performance networks is negative, especially, if
pirical evaluation using widely accessible data sets there is a clear leader among the networks being
substantiates the validity of the approaches with combined.
data-dependent weights compared to various ex- A higher performance of the committee of
isting combination schemes of multiple classi®ers. networks could probably be achieved using the
The majority rule, the Borda count, combination bootstrapping techniques for constructing the
by averaging, the weighted averaging, the Bayesian training sets. Bootstrapping was not used, how-
combination, the linear combination of order sta- ever, since the goal was to compare various com-
tistics and combination by fuzzy integral have bining techniques, not to achieve a classi®cation
been used for the comparison. error as small as possible.
Combination by the weighted averaging with
data-dependent weights has shown the best overall
performance. Though the Choquet fuzzy integral Acknowledgements
with data-dependent densities was among the best
approaches for networks fusion, the overall clas- We gratefully acknowledge the support we have
si®cation accuracy achieved by using this ap- received from The Swedish National Board for
proach was slightly lower than the accuracy Industrial and Technical Development. We also
obtained from the weighted averaging with data- thank two anonymous reviewers for their valuable
dependent weights. For the large training sets, the comments.
BADD combination scheme was also among the
best approaches. For the small training sets,
however, this combination scheme was not a good References
choice.
Amari, S., Murata, N., Muller, K.-R., Finke, M., Yang, H.,
Optimisation techniques have been used to es- 1996. Statistical theory of overtraining ± Is cross-validation
timate the combination weights for the following asymptotically eective?. In: Touretzky, D.S., Mozer, M.C.,
four approaches: Zimmermann's compensatory Hasselmo, M.E. (Eds.), Advances in Neural Information
operator, the weighted averaging, the optimised Processing Systems 8. MIT Press, Cambridge, pp. 176±182.
Battiti, R., Colla, M., 1994. Democracy in neural nets: Voting
Choquet integral and the LOS. The optimised
schemes for classi®cation. Neural Networks 7 (4), 691±707.
Choquet integral has shown the best overall per- Ceccarelli, M., Petrosino, A., 1997. Multi-feature adaptive
formance amongst these four approaches. The classi®ers for SAR image segmentation. Neurocomputing
correct recognition rate obtained from the 14, 345±363.
A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444 443
Chen, W., Gader, P.D., Shi, H., 1997. Improved dynamic dency and its application to combining multiple classi®ers.
programming-based handwritten word recognition using Pattern Recognition Letters 18, 515±523.
optimal order statistics. In: Proc. of the Internat. Conf. Kittler, J., Hojjatoleslami, A., Windeatt, T., 1997a. Strategies
Statistical and Stochastic Methods in Image Processing II. for combining classi®ers employing shared and distinct
San Diego, pp. 246±256. pattern representations. Pattern Recognition Letters 18,
Chi, Z., Yan, H., Pham, T.D., 1996. Fuzzy Algorithms: With 1373±1377.
Applications to Image Processing and Pattern Recognition. Kittler, J., Matas, J., Jonsson, K., Ramos S anchez, M.U.,
World Scienti®c, Singapore. 1997b. Combining evidence in personal identity veri®cation
Cho, S.B., Kim, J.H., 1995. Combining multiple neural systems. Pattern Recognition Letters 18, 845±852.
networks by fuzzy integral for robust classi®cation. IEEE Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., 1998. On
Transactions on System, Man, and Cybernetics 25 (2), 380± combining classi®ers. IEEE Transactions on Pattern Anala-
384. lysis and Machine Intelligence 20 (3), 226±239.
Denoeux, T., 1995. A k-nearest neighbor classi®cation rule Krishnapuram, R., Lee, J., 1992. Fuzzy-set-based hierarchical
based on Dempster±Shafer theory. IEEE Transactions on networks for information fusion in computer vision. Neural
System, Man and Cybernetics ± Part B 25 (5), 804±813. Networks 4, 335±350.
Efron, B., Tibshirani, R., 1993. An Introduction to the Krogh, A., Vedelsby, J., 1995. Neural network ensembles, cross
Bootstrap. Chapman and Hall, London. validation, and active learning. In: Tesauro, G., Touretzky,
Filev, D., Yager, R.R., 1991. A generalized defuzzi®cation D.S., Leen, T.K. (Eds.), Advances in Neural Information
method under BADD distributions. International Journal Processing Systems 7. MIT Press, Cambridge, pp. 231±238.
of Intelligent Systems 6, 687±697. Kuncheva, L., 1997. An application of OWA operators to the
Gader, P.D., Mohamed, M.A., Keller, J.M., 1996. Fusion of aggregation of multiple classi®cation decisions. In: Yager,
handwritten word classi®ers. Pattern Recognition Letters R., Kacprzyk, J. (Eds.), The Ordered Weighted Averaging
17, 577±584. Operators. Theory and Applications. Kluwer Academic
Grabisch, M., 1995. Fuzzy integral in multicriteria decision Publishers, Dordrecht, pp. 330±343.
making. Fuzzy Sets and Systems 69, 279±298. Kuncheva, L., Bezdek, J., Sutton, M., 1998. On combining
Grabisch, M., 1996. The representation of importance and multiple classi®ers by fuzzy templates. In: Proceedings of
interaction of features by fuzzy measures. Pattern Recog- the 1998 Annual Meeting of the North American Fuzzy
nition Letters 17, 567±575. Information Processing Society, NAFIPS'98, Pensacola FL,
Grabisch, M., Nicolas, J.-M., 1994. Classi®cation by fuzzy pp. 193±197.
integral: Performance and tests. Fuzzy Sets and Systems 65, Lam, L., Suen, C.Y., 1995. Optimal combination of pattern
255±271. classi®ers. Pattern Recognition Letters 16, 945±954.
Hashem, S., 1997. Optimal linear combinations of neural Le Hegarat-Mascle, S., Bloch, I., Vidal-Madjar, D., 1998.
networks. Neural Networks 10 (4), 599±614. Introduction of neighborhood information in evidence
Heskes, T., 1997. Balancing between bagging and bumping. In: theory and application to data fusion of radar and optical
Mozer, M.C., Jordan, M.I., Petsche, T. (Eds.), Advances in images with partial cloud cover. Pattern Recognition 31
Neural Information Processing Systems 9. MIT Press, (11), 1811±1823.
Cambridge, pp. 466±472. Merz, C.J., Pazzani, M.J., 1997. Combining neural network
Ho, T.K., Hull, J.J., Srihari, S.N., 1994. Decision combination regresion estimates with regularized linear weights. In:
in multiple classi®er systems. IEEE Transaction on Pattern Mozer, M.C., Jordan, M.I., Petsche, T. (Eds.), Advances
Analysis and Machine Intelligence 16 (1), 66±75. in Neural Information Processing Systems 9. MIT Press,
Hocaoglu, A.K., Gader, P.D., 1998. Choquet integral repre- Cambridge, pp. 564±570.
sentations of nonlinear ®lters with applications to LAD- Mirhosseini, A.R., Yan, H., Lam, K.-M., Pham, T., 1998.
AR image processing. In: Proceedings of the SPIE Human face recognition: an evidence aggregation approach.
Conference Nonlinear Image Processing IX, San Jose, Computer Vision and Image Understanding 71 (2), 213±
CA, pp. 66±72. 230.
Jacobs, R.A., 1995. Methods for combining experts probability Munro, P.W., Parmanto, B., 1997. Competition among net-
assessments. Neural Computation 7 (5), 867±888. works improves committee performance. In: Mozer, M.C.,
Ji, C., Ma, S., 1997. Combined weak classi®ers. In: Mozer, Jordan, M.I., Petsche, T. (Eds.), Advances in Neural
M.C., Jordan, M.I., Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9. MIT Press, Cambridge,
Information Processing Systems 9. MIT Press, Cambridge, pp. 592±598.
pp. 494±500. Perrone, M.P., Cooper, L.N., 1993. When networks disagree:
Jutten, C., Guerin-Dugue, A., Aviles-Cruz, C., Voz, J.L., Van Ensamble method for neural networks. In: Mammone, R.J.
Cappel, D., 1995. ESPRIT basic research project number (Ed.), Neural Networks for Speech and Image Processing.
6891 ELENA, Enhanced learning for evolutive neural Chapman and Hall, London.
architecture. Pham, T., Yan, H., 1997. Fusion of handwritten numeral
Kang, H.-J., Kim, K., Kim, J.H., 1997. Optimal approximation classi®ers based on fuzzy and genetic algorithms. In:
of discrete probability distribution with kth-order depen- Proceedings of the 1997 Annual Meeting of the North
444 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
American Fuzzy Information Processing Society, NA- Verikas, A., Malmqvist, K., Bergman, L., Signahl, M., 1998.
FIPS'97, New York, pp. 257±262. Colour classi®cation by neural networks in graphic arts.
Rogova, G., 1994. Combining the results of several neural Neural Computing and Applications 7, 52±64.
network classi®ers. Neural Networks 7 (5), 777±781. Wang, L., Chen, K., Chi, H., 1997. Methods of linear
Shi, H., Gader, P.D., Chen, W., 1998. Fuzzy integral ®lters: combination based on dierent features. In: Proceedings
Properties and parallel implementation. Real-Time Imaging of the International Conference on Neural Information
4 (4), 233±241. Processing, Dunedin, New Zealand, Vol. 2, pp. 1088±1091.
Sollich, P., Krogh, A., 1996. Learning with ensembles: How Waterhouse, S., Cook, G., 1997. Ensemble methods for
over-®tting can be useful. In: Touretzky, D.S., Mozer, phoneme classi®cation. In: Mozer, M.C., Jordan, M.I.,
M.C., Hasselmo, M.E. (Eds.), Advances in Neural Infor- Petsche, T. (Eds.), Advances in Neural Information Pro-
mation Processing Systems 8. MIT Press, Cambridge, pp. cessing Systems 9. MIT Press, Cambridge, pp. 800±806.
190±197. Woods, K., Kegelmeyer, W.P., Bowyer, K., 1997. Combination
Sugeno, M., 1977. Fuzzy measures and fuzzy integrals: A of multiple classi®ers using local accuracy estimates. IEEE
survey. In: Automata and Decision Making. North Hol- Transactions on Pattern Analysis and Machine Intelligence
land, Amsterdam, pp. 89±102. 19 (4), 405±410.
Tahani, H., Keller, J.M., 1990. Information fusion in computer Xu, L., Krzyzak, A., Suen, C.Y., 1992. Methods for combining
vision using the fuzzy integral. IEEE Transactions on multiple classi®ers and their applications to handwriting
Systems, Man and Cybernetics 20 (3), 733±741. recognition. IEEE Transactions on Systems, Man, and
Taniguchi, M., Tresp, V., 1997. Averaging regularized estima- Cybernetics 22 (3), 418±435.
tors. Neural Computation 9, 1163±1178. Yager, R.R., Filev, D.P., 1994. On a fexible structure for fuzzy
Tresp, V., Taniguchi, M., 1995. Combining estimators using systems models. In: Yager, R.R., Zadeh, L.A. (Eds.), Fuzzy
non-constant weighting functions. In: Tesauro, G., Tour- Sets, Neural Networks, and Soft Computing. Van Nostrand
etzky, D.S., Leen, T.K. (Eds.), Advances in Neural Infor- Reinhold, New York.
mation Processing Systems 7. MIT Press, Cambridge. Zimmermann, H.J., Zysno, P., 1980. Latent connectives in
Verikas, A., Signahl, M., Malmqvist, K., Bacauskiene, M., human decision making. Fuzzy Sets and Systems 4 (1), 37±
1997. Fuzzy committee of experts for segmentation of 51.
colour images. In: Proceedings of Fifth European Congress Zimmermann, H.J., Zysno, P., 1984. Decisions and evaluations
on Intelligent Techniques and Soft Computing, Aachen, by hierarchical aggregation of information. Fuzzy Sets and
Germany, Vol. 3, pp. 1902±1906. Systems 10 (3), 243±260.