Soft Combination of Neural Classi®ers: A Comparative Study

Pattern Recognition Letters 20 (1999) 429±444
Soft combination of neural classi®ers: A comparative study

Antanas Verikas a,b,*, Arunas Lipnickas b, Kerstin Malmqvist a,
Marija Bacauskiene b, Adas Gelzinis b
a
Centre for Imaging Sciences and Technologies, Halmstad University, P.O. Box 823, S-301 18 Halmstad, Sweden
b
Department of Applied Electronics, Kaunas University of Technology, Studentu 50, 3031 Kaunas, Lithuania
Received 8 June 1998; received in revised form 18 November 1998
Abstract
This paper presents four schemes for soft fusion of the outputs of multiple classi®ers. In the ®rst three approaches,
the weights assigned to the classi®ers or groups of them are data dependent. The ®rst approach involves the calculation
of fuzzy integrals. The second scheme performs weighted averaging with data-dependent weights. The third approach
performs linear combination of the outputs of classi®ers via the BADD defuzzi®cation strategy. In the last scheme, the
outputs of multiple classi®ers are combined using Zimmermann's compensatory operator. An empirical evaluation
using widely accessible data sets substantiates the validity of the approaches with data-dependent weights, compared to
various existing combination schemes of multiple classi®ers. Ó 1999 Elsevier Science B.V. All rights reserved.
Keywords: Classi®cation; Decision fusion; Neural network; Fuzzy integral
1. Introduction space (Woods et al., 1997); the Borda count (Ho

et al., 1994); the Bayes approach (Xu and Krzy-
It is well known that a combination of many zak, 1992; Lam and Suen, 1995); the Dempster±
dierent classi®ers can improve classi®cation ac- Shafer theory (Xu and Krzyzak, 1992; Rogova,
curacy. A variety of schemes have been proposed 1994; Denoeux, 1995; Le Hegarat-Mascle et al.,
for combining multiple classi®ers. The approaches 1998); the fuzzy integral (Chen et al., 1997; Chi et
used most often include the majority vote (Xu and al., 1996; Cho and Kim, 1995; Gader et al., 1996;
Krzyzak, 1992; Battiti and Colla, 1994; Lam and Grabisch, 1995, 1996; Grabisch and Nicolas, 1994;
Suen, 1995; Ji and Ma, 1997; Waterhouse and Hocaoglu and Gader, 1998; Mirhosseini et al.,
Cook, 1997); average (Munro and Parmanto, 1998; Pham and Yan, 1997; Shi et al., 1998; Ta-
1997; Taniguchi and Tresp, 1997); the weighted hani and Keller, 1990); fuzzy connectives
average (Hashem, 1997; Verikas et al., 1997; Per- (Kuncheva, 1997); fuzzy templates (Kuncheva
rone and Cooper, 1993; Heskes, 1997; Merz and et al., 1998); the probabilistic schemes (Kittler
Pazzani, 1997; Wang et al., 1997; Jacobs, 1995); et al., 1997a,b, 1998; Kang et al., 1997); and combi-
selection of the best in some region of the input nation by a neural network (Ceccarelli and Pet-
rosino, 1997).
We can say that a combiner assigns weights of
*
Corresponding author. Tel: +46 35 167 140; fax: +46 35 216 value to classi®ers in one way or another. The
724; e-mail: av@cist.hh.se weights can be data dependent (Woods et al., 1997;
0167-8655/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 0 1 2 - 4
430 A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444
Tresp and Taniguchi, 1995) or can express the dice.ucl.ac.be in the directory pub/neural/ELENA/
worth averaged over the entire data space (Sollich databases. From the ELENA project we have
and Krogh, 1996; Krogh and Vedelsby, 1995). The chosen two data sets representing arti®cial data,
use of data-dependent weights, when properly es- Clouds and Concentric, and two sets representing
timated, provides higher classi®cation accuracy real applications, Phoneme and Satimage.
(Woods et al., 1997; Verikas et al., 1998). The Clouds data are two-dimensional. There are
In practice, outputs of multiple classi®ers are two a priori equally probable classes in the data
usually highly correlated. Therefore, it is desirable set. The graphical representation of the data set is
to assign weights not only to individual classi®ers, given in Fig. 1. The Clouds data classes are rela-
but to groups of them also. This would express tively highly overlapped. The class boundaries are
interaction between dierent classi®ers. Combina- highly nonlinear. The class x0 is the sum of three
tion schemes based on fuzzy integrals possess this dierent Gaussian distributions
valuable property (Cho and Kim, 1995; Gader et al.,
1 p1 x p2 x
1996; Verikas et al., 1998). px j x0 p3 x ; 1
2 2 2
In spite of the variety of combination schemes
proposed, only a small amount of eort was made where
to compare the dierent approaches using widely !
2 2
1 x ÿ mjx x ÿ mjy
available databases. The objective of this work is pj x exp ÿ 2
ÿ ;
to present methods for improving accuracy of 2prjx rjy 2rjx 2r2jy
multiple neural classi®er systems as well as to 2
compare the proposed approaches with various
where mjx and mjy are the x and y means of the j's
existing combination schemes of multiple classi®-
distribution and rjx , rjy are the corresponding
ers on widely accessible databases. We propose
standard deviations. Table 1 shows the parameters
four schemes for soft fusion of outputs of multiple
of the distributions of class x0 .
classi®ers. The ®rst approach involves the calcu-
The class x1 is a single Gaussian distribution:
lation of fuzzy integrals. The second scheme per-
forms weighted averaging with data-dependent 1 x2 y 2
px j x1 exp ÿ : 3
weights. The third approach performs linear 2p 2
combination of the outputs of classi®ers via the
The theoretical error is 9.66%. The error ob-
BADD defuzzi®cation (Basic Defuzzi®cation Dis-
tained in the ELENA project when using a single
tributions (Filev and Yager, 1991)) strategy. In the
last scheme, the outputs of multiple classi®ers are
combined using Zimmermann's compensatory
operator. We begin with a short description of the
databases used. Next, we brie¯y describe the var-
ious combination schemes used for comparison as
well as our proposed approaches. Finally, the ex-
perimental procedures and results for four dier-
ent data sets are presented.
2. Data
The ESPRIT Basic Research Project Number

6891 (ELENA) provides databases and technical
reports designed for testing both conventional and
neural classi®ers. All the databases and technical
reports are available via anonymous ftp: ftp. Fig. 1. The Clouds data.
A. Verikas et al. / Pattern Recognition Letters 20 (1999) 429±444 431
Table 1 Table 2
Parameters of the distributions of class x0 Summary of the Satimage database
j mjx mjy rjx rjy Class Description Number of Percentage
label instances
1 0.0 0.0 0.2 0.2
2 0.0 2.0 0.2 0.2 1 Red soil 1533 23.82
3 2.0 1.0 0.2 1.0 2 Cotton crop 703 10.92
3 Grey soil 1358 21.10
4 Damp grey soil 626 9.73
multilayer perceptron (MLP) is 12.3%. In all the 5 Soil with vegetation 707 10.99
experiments presented in the ELENA project, the stubble
MLP used contained two hidden layers of 20 and 6 Very damp grey soil 1508 23.43
10 units, respectively.
The Concentric data are two-dimensional with and 255 to white. The Satimage database is a sub-
two classes and uniform concentric circular dis- area of a scene, consisting of 82 ´ 100 pixels. The
tributions. The points of the class x0 are uniformly database contains 6435 36-dimensional patterns (4
distributed into a circle of radius 0.3 centred on spectral bands ´ 9 pixels in a neighbourhood).
(0.5, 0.5). The points of the class x1 are uniformly There are six classes in the database. Table 2 gives
distributed into a ring centred on (0.5, 0.5) with a summary of the classes. The database also con-
internal and external radii equal to 0.3 and 0.5, tains the ®ve-dimensional description of the data.
respectively. There are 2500 instances, with 1579 in The ®ve dimensions were obtained by using dis-
class x1 and the remainder in class x0 . There is no criminate factorial analysis. The ®ve-dimensional
overlap between the classes. The theoretical error description of the Satimage data was used in this
is 0%. The mean error, reported in the ELENA study.
project is 2.8%. The graphical representation of The Phoneme database. The aim of this data-
the Concentric data set is given in Fig. 2. base is to distinguish between nasal and oral
The Satimage database was generated from vowels. Thus, there are two dierent classes: the
Landsat Multi-Spectral Scanner image data. One Nasals in class x0 and the Orals in class x1 . This
frame of the Landsat MSS imagery consists of database contains vowels coming from 1809 iso-
four digital images of the same scene in dierent lated syllables. Five features were chosen to char-
spectral bands. The spatial resolution of a pixel is acterise each vowel. The features are the
about 80 m ´ 80 m. Each image contains 2340 ´ amplitudes of the ®ve ®rst harmonics normalised
3380 such pixels, with 0 corresponding to black by the total energy integrated on all the frequen-
cies. Each harmonic is signed either positive when
it corresponds to a local maximum of the spectrum
or negative otherwise (Jutten et al., 1995). There
are 3818 patterns from class x0 and 1586 patterns
from class x1 .
The data sets used are summarised in Table 3.
The ELENA errors presented in the table are ta-
ken from the ELENA project. The errors pre-
sented are the average errors obtained in the
ELENA project while using MLP of the above-
mentioned structure.
3. Combination schemes used
The majority rule, the Borda count, combina-

Fig. 2. The Concentric data. tion by averaging, the weighted averaging, the
Table 3
Summary of the data sets used
Data set Number of classes Number of features Number of samples ELENA error % Bayes error %
Clouds 2 2 5000 12.3 1.1 9.66
Concentric 2 2 2500 2.8 0.7 0
Phoneme 2 5 5404 16.4 1.3 Not available
Satimage 6 5 6435 11.9 1.0 Not available
Bayesian combination, the linear combination of !

1X L
order statistics and combination by fuzzy integral q arg max yj yji ; 4
are the combination schemes used for the com- j1;...;Q L i1
parison. Next, we brie¯y describe the schemes as where Q is the number of classes, L is the number
well as the proposed approaches for soft fusion of of classi®ers and yji represents the jth output of the
outputs of multiple classi®ers. The proposed ap- ith classi®er.
proaches are further referred to as ``Combination
by fuzzy integral with data-dependent densities'', 3.3. Borda count
``Combination by weighted averaging with data-de-
pendent weights'', ``Combination by the BADD For any class q, the Borda count is the sum of
defuzzi®cation strategy'' and ``Combination by the number of classes ranked below q by each
Zimmermann's compensatory operator''. classi®er. If Bj q is the number of classes ranked
Bootstrapping (Taniguchi and Tresp, 1997; below the class q by the jth classi®er, then the
Efron and Tibshirani, 1993) is not used in this Borda count for class q is
study, since the goal is to compare various schemes
X
L
for combining outputs of multiple neural classi®er, Bq Bj q; 5
not to achieve the lowest possible classi®cation j1
error by constructing special training sets. The
classi®ers we combine are often highly correlated. where L is the number of classi®ers used. The ®nal
This is often the case if one uses the L-fold cross- decision is given by selecting the class yielding the
validation with partially overlapping learning sets largest Borda count.
when training a committee of L classi®ers. The
classi®ers we fuse are sometimes rather diverse in 3.4. Bayesian combination
their performances.
Let C k denote the confusion matrix of size
Q Q for kth classi®er estimated on a training set.
3.1. Majority rule k
Entry cij denotes the number of data with actual
class i that are assigned class j. Thus, the total
The correct class is the one most often chosen number of data from class i is given by
by dierent classi®ers. If all the classi®ers indicate
dierent classes, then the one with the overall X
Q
k
Ni cij : 6
maximum output value is selected to indicate the j1
correct class.
The total number
P of data that are assigned class
k
j is found as Qi1 cij . The conditional probability
3.2. Averaging that sample x actually belongs to class i, given that
classi®er k assigns it to class j (kk x jk ) can be
This approach simply averages the individual estimated as
classi®er outputs. The output yielding the maxi- .XQ
k
mum of the averaged values is chosen as the cor- P x 2 qi j kk x jk cij ckij : 7
rect class q: i1
For any sample x L classi®ers make decisions h i

kk x jk ; 1 6 k 6 L. Given the decisions and as- e E f yx ÿ dxg2
" ! !#
suming the independence of classi®ers, a belief XL X
L
value that x belongs to class i can be approximated E w i ei w j ej
by (Xu and Krzyzak, 1992) i1 j1
X
L X
L
Q
L
wi wj cij : 14
P x 2 qi j kk x jk i1 j1
k1
Beli ; 1 6 i 6 Q:
Q Q
P L Optimal values for the weights wi can be deter-
P x 2 qi j kk x jk
i1 k1
mined by the minimisation of e. APnon-trivial
L
8 minimum can be found by requiring i1 wi 1.
The solution for the wi is
The ®nal decision is then made according to the PL ÿ1
following rule: j1 C ij
wi P L P L ÿ1
: 15
k1 j1 C kj
x 2 qi ; if Beli P Belj; 8j 6 i: 9
One problem with the constraint imposed is
that it does not prevent the weights from adopting
3.5. Weighted averaging
large negative or positive values. The inverse of C
can be unstable. Redundancy in members of a
For notational convenience, we consider net-
committee leads to linear dependencies in the rows
works with a single output yi , although the gen-
and columns of C. We might, therefore, seek to
eralisation to several outputs is straightforward.
constrain the weights further by requiring that
We denote the true output of a network by dx.
wi P 0; 8i 1; . . . ; L:
Thus, the actual output of each network can be
The generalisation error of a committee can be
written as the desired output plus an error:
decomposed into the sum of two terms (Krogh and
yi x dx ei x: 10 Vedelsby, 1995):
The average squared error for the ith network can h i
2
be written as e E f yx ÿ dxg
h i X h i
2 2
ei E fyi x ÿ dxg E e2i ; 11 wi E fyi x ÿ dxg
i
X h i
where E denotes the expectation. 2
ÿ wi E fyi x ÿ yxg : 16
A weighted combination of outputs of a set of i
L networks (a committee output) can be written
as The ®rst term depends only on the errors of
individual networks, while the second term de-
X
L X
L
pends on the spread of outputs of the committee
yx wi yi x dx wi ei x; 12
members. We see that the committee error will
i1 i1
decrease, if we can increase the spread of outputs
where the weights wi need to be determined. If C is of the committee members without increasing the
the error correlation matrix given by errors of the members themselves. Using that the
weights sum to one, Eq. (16) can be written as
cij E ei x ej x
(Krogh and Vedelsby, 1995)
1X N
X X X
yi xn ÿ dxn yj xn ÿ dxn ; 13 e w k ek wl ckl wk ÿ wk ckk ; 17
N n1
k k;l k
where N is the number of samples, the error due to where ek is the estimate of the error of the
committee can be written as kth network. Using estimates of the errors of
individual networks from cross-validation sets the gA1 gfz1 g g1 ; 21

weights can be learned from unlabelled samples by
some optimisation technique. We determine the gAi gi gAiÿ1 kgi gAiÿ1 ; for 1 < i 6 L:
weights by minimising error function (18) instead
22
of (17):
X X X X
L
De®nition 2. Let g be a fuzzy measure on Z. The
e wk ek wl ckl wk ÿ wk ckk k w2k ;
k k;l k k1
discrete Choquet integral of a function h : Z ! R
with respect to g is de®ned as
18
X
L
where the last term is used for regularisation and k Cg fhz1 ; . . . ; hzL g fhzi ÿ hziÿ1 ggAi ;
is a regularisation coecient. i1
23
3.6. Combination via fuzzy integral
where indices i have been permuted so that
0 6 hz1 6 6 hzL 6 1, Ai fzi ; . . . ; zL g and
3.6.1. Background on fuzzy integral and fuzzy
hz0 0.
measure
3.6.2. Combination by Choquet integral
De®nition 1. A set function g : 2Z ! 0; 1 is a
We adopted Sugeno's k-fuzzy measure and as-
fuzzy measure if
signed the fuzzy densities gi , the degree of impor-
1. g0 0; gZ 1;
tance of each network, based on the performance
2. if A; B 2Z and A B; then; gA 6 gB;
of the networks on validation data. We computed
3. if An 2Z for 1 6 n < 1 and the sequence fAn g
the densities as follows:
is monotone in the sense of inclusion, then
pi
g i PL dS ; 24
lim gAn g lim An :
n!1 n!1 j1 pj
In general, the fuzzy measure of a union of two where pi is the performance of the ith network
disjoint subsets cannot be directly computed from and dS is the desired sum of fuzzy densities. We
the fuzzy measures of the subsets. Sugeno (1977) assumed that committee members have Q outputs
introduced the decomposable so called k-fuzzy representing Q classes, and data point x needs to
measure satisfying the following additional prop- be assigned into one of the classes. The class label
erty: c for the data point x is then determined as fol-
lows:
gA [ B gA gB kgAgB 19
c arg max Cg q; 25
for all A; B Z and A \ B 0, and for some q1;...;Q
k > ÿ1. where Cg q is the Choquet integral for the class q.
Let Z fz1 ; z2 ; . . . ; zL g be a ®nite set (a set of The values of function hz that appear in the
committee members in our case) and let Choquet integral are given by the output values of
gi gfzi g. The values gi are called the densities members of the committee (an evidence provided
of the measure. The value of k is found from the by the members).
equation gZ 1, which is equivalent to solving
the following equation:
3.7. Combination by fuzzy integral with data-
Y
L
dependent densities
k1 1 kgi : 20
i1
We use Sugeno's k-fuzzy measure and the
When g is the k-fuzzy measure, the values of gAi Choquet integral. This approach assumes that
can be computed recursively as follows: data space is partitioned into K regions with
reference point vi representing the ith region. The and

reference points are found by performing fre- bi dx; vi c1=K ÿ fi;new ; i 1; 2; . . . ; K;
quency-sensitive competitive learning. The learn-
ing algorithm is given at the end of the 31
subsection. The densities of the measure and the where c and b are constants and zi 1 for the
values of k are determined for each of the regions. winner and zi 0 for all the losers of the
Calculation of the densities is based on the per- competition held based on Eq. (29).
formance of the committee members in the re- 4. Find a ``winning'' weight vector using the fol-
gions. We use a validation set to evaluate the lowing equation:
performances. Let pij be the performance of the k arg mindx; vi ÿ bi ; i 1; 2; . . . ; K:
ith committee member in region j. Then density i
gij (a density related to the ith member in region j) 32

is given by 5. Update the winning weight vector vk t accord-
pij ing to the rule
gij PL dS ; 26
t1 ptj vk t 1 vk t ct x ÿ vk t; 33
where L is the number of the committee members. where parameter fct g is a slowly decreasing
Assume that the committee members have Q sequence of learning coecients.
outputs representing Q classes and data point x 6. Iterate steps 2±5 a predetermined number of
needs to be assigned into one of the classes. Then times.
the class label c for the data point x is determined
as follows: 3.8. Combination by weighted averaging with data-
c arg max Cgk q; 27 dependent weights
q1;...;Q
where Cgk q is the Choquet integral for class q The data-dependent weights are determined in
calculated in region k. The index of the region is the same way as the data-dependent densities for
determined in the following way: the Choquet integral. The output of the combiner
k arg min dx; vi ; 28 is given by
i1;...;K
X
L
with dx; vi being the Euclidean distance between q arg max wik yij ; 34
j1;...;Q
the data point x and vector vi representing the ith i1
region. where yij is the jth output of the ith network and
the weights wik are calculated in the same way as
3.7.1. Frequency-sensitive competitive learning gij in Eq. (26). The region index k is given by
The learning algorithm is encapsulated in the Eq. (28).
following six steps.
1. Initialise the weight vectors vi 0 of all the K 3.9. Combination by the BADD defuzzi®cation
nodes with small random values. strategy
2. For a given data point x ®nd a weight vector
vj t yielding a minimum distance: This approach assumes that data space is par-
dx; vj min kvi t ÿ xk ; 2
i 1; 2; . . . ; K: titioned into L overlapping regions and each net-
i work is associated with one of the regions.
29 Decisions from the networks are now combined
3. Calculate according to the following rule:
fi;new fi;old bzi ÿ fi;old ; i 1; 2; . . . ; K PL d

i1 li x yi x
yx PL ; 35
30 d
i1 li x
where yi x is the output of the ith network, d is a desired outputs. The constraints on c and wi can be
parameter, and li x is the membership degree of eliminated by modifying the de®nition of the pa-
given x in the ith region of the data space. The rameters as follows (Krishnapuram and Lee, 1992):
membership degrees are calculated in the following
a2
way: c ; 38
a 2 b2
1
li x p; 36 Ld 2
1 dx; wi wi PL i : 39
where wi is the weight vector representing the ith k1 dk2
region of the data space, p is a constant, and Now a, b and di can be chosen without any con-
dx; wi is the Euclidean distance between vectors straints.
x and wi .
This combination scheme actually is the BADD 3.11. Optimising the fuzzy measure
defuzzi®cation strategy. The parameter d deter-
mines the type of defuzzi®cation (Yager and Filev, Parameters of the k-fuzzy measure we used in
1994). The arithmetic mean, the centre of area and Sections 3.6 and 3.7 were subjectively estimated
the mean of maximum combination schemes are from performance of the networks without further
obtained by changing parameter d. After some optimisation. Such an approach to estimation of
experiments, the parameter d has been chosen to the parameters of the k-fuzzy measure is often used
be d 3. in practice (Cho and Kim, 1995; Chi et al., 1996).
As pointed out by one of the reviewers of the
3.10. Combination by Zimmermann's compensatory manuscript, the comparison between the weighted
operator averaging and the Choquet integral would be more
level, if the Choquet integral were optimised in a
Fuzzy integrals are weighted aggregation oper- fashion similar to that used for the weighted av-
ators whose weights are de®ned not only on the eraging. It is interesting to compare the weighted
dierent elements being aggregated, but also on all averaging and the fuzzy integral on an equal
subsets of them. This allows thus the representa- footing, since the weighted averaging is in fact the
tion of redundancy and support among the ele- fuzzy integral with a special kind of fuzzy measure.
ments. It is interesting, therefore, to compare The discrete Choquet integral can be written as
aggregation by fuzzy integrals with aggregation by XL
fuzzy operators of another type. We have chosen Cg hzi fgAi ÿ gAi1 g; 40
the Zimmermann's compensatory operator for this i1
purpose (Zimmermann and Zysno, 1980, 1984): where gAL1 0. The fuzzy measure can be
!1ÿc !c chosen so that the measure of a set depends only
YL
wi
YL
wi on the size of the set (Chen et al., 1997):
yx yi x 1 ÿ 1 ÿ yi x ;
i1 i1 X
iÿ1
37 gA wnÿj 8A such that j Aj i: 41

j0
PL
where i1 wi L, 0 6 c 6 1 and yi x 2 0; 1. The Then gAi does not depend on the ordering,
parameter c controls the degree of compensation hz1 6 hz2 6 6 hzL . Thus, the dierences
between the union and intersection parts of the gAi ÿ gAi1 can be written as
operator. The parameter wi represents the weight
associated with the ith network (classi®er). Since wi gAi ÿ gAi1 : 42
the operator is continuous and dierentiable with Therefore,
respect to c and wi , gradient descent methods can XL
be used to obtain the values of the parameters that Cg wi hzi : 43
best match the given inputs and the corresponding i1
Let z z1 ; z2 ; . . . ; zL be a vector. The ith order ®ve parts when training members of the commit-
statistic zi of z is the ith smallest element of z tee. We use one data part, which is dierent for
where z1 6 z2 6 6 zL . Let w w1 ; w2 ; . . . ; each member of the committee, for ``early stop-
PLL be a weight vector constrained so that
w ping''. The other parts are used for training.
i1 wi 1 and 0 6 wi 6 1; 8i 1; 2; . . . ; L. Therefore, the members are trained on partially
The linear combination of order statistics of z disjoined learning sets. We use the entire HO
z1 ; z2 ; . . . ; zL with the weight vector w training part of data to estimate the combination
w1 ; w2 ; . . . ; wL is de®ned as (Chen et al., 1997) weights or fuzzy densities. Ten reference points are
X
L used to compute data-dependent densities and
LOSz; w wi zi : 44 weights. We present the combination results for
i1 two types of training: with early stopping (ES) ±
Thus, the weighted averaging and the LOS the middle three columns of the tables given be-
operators are in fact Choquet integral operators. low; and saturated learning (SL) ± the three col-
Therefore, we included the optimised version of umns on the right-hand side of the tables.
the Choquet integral with the k-fuzzy measure and
the LOS operator in our comparisons. The opti- 4.1. The Clouds data
misation of the Choquet integral has been done in
the same manner as of the weighted averaging For the Clouds data, training of each MLP was
approach. The optimal weights for the LOS op- repeated 10 times with dierent initialisations. The
erator are found by minimising the following best outcome of the 10 trials was used when testing
quantity: the combination schemes. Therefore, the result
( )2 presented in Tables 4 and 5 as ``The best single
XN XQ XL
NN'' is actually the best result selected from the 50
wi zji xn ÿ dj xn ; 45 trials (10 trials for each NN ´ 5). Due to such a
n1 j1 i1
careful selection, the members of the committee
where N is the number of data samples, Q is the were very even in their performances. Table 4
number of classes, z and d stand for the neural presents performance evaluation results of various
network output value and the target value, re- schemes of combination for the Clouds data, when
spectively. This is a quadratic objective function 2000 and 500 data points were included in the
with linear constraints. The minimisation is done learning and early stopping sets (if used), respec-
using quadratic programming (Grabisch and Ni- tively.
colas, 1994). As can be seen from Table 4, there is no im-
provement in the classi®cation accuracy when
combining regularised (by early stopping) net-
4. Experimental testing works. Only a combination by the BADD de-
fuzzi®cation strategy gives a little rise in
All comparisons between dierent classi®ers in classi®cation accuracy. Encountering such behav-
the ELENA project have been done using the iour of the combined network, one can expect the
Holdout (HO) method with equal training and outputs of the separate networks to be highly
testing parts of the data. To make the comparisons correlated. The computed mean value of the cor-
feasible we have also used the same method to relation coecients between the outputs of the
estimate the classi®cation accuracy. dierent networks substantiated that this is indeed
In all the tests presented here, we train ®ve one- the case. The mean value of the correlation
hidden layer MLPs with 10 sigmoidal hidden units coecient was found to be q 0:787. When
using the error-backpropagation algorithm. We combining networks by the BADD defuzzi®cation
run each experiment eight times, and the min, strategy, we train separate networks in the dier-
mean and max errors present are calculated from ent regions of the input space. This leads to more
these eight trials. We divide the learning set into diverse outputs of the networks. Therefore, a slight
Table 4
Performance of various schemes of combination for the Clouds data when training sets consist of 2000 instances
Classi®cation schemes\Errors % Min Mean Max Min Mean Max
The best single NN 10.7 11.0 11.2 11.0 11.2 11.3
Majority rule 11.0 11.1 11.3 10.9 11.0 11.3
Borda count 11.0 11.1 11.3 10.9 11.0 11.3
Averaging 11.0 11.1 11.2 10.9 11.0 11.5
Bayesian 10.6 11.0 11.3 10.8 10.9 11.1
Zimmermann's operator 10.5 11.1 11.2 10.9 11.1 11.2
Weighted averaging (WA) 10.6 10.9 11.1 10.8 11.0 11.2
WA with dd weights 10.6 11.0 11.2 10.7 11.2 11.4
Choquet integral (CI) 10.8 11.1 11.2 10.9 11.0 11.2
Optimised Choquet integral 10.7 11.0 11.1 10.8 11.0 11.2
LOS 10.7 11.0 11.1 10.8 11.0 11.4
CI with dd densities 10.5 11.0 11.2 10.9 11.1 11.4
BADD 10.7 10.8 10.9 10.8 11.2 11.5
The abbreviation ``dd'' used in the table stands for ``data dependent''.
Table 5
Performance of various combination schemes for the Clouds data when training sets consist of 150 instances
Majority rule 12.0 12.3 12.6 12.0 12.7 13.5
Borda count 12.0 12.3 12.6 12.0 12.7 13.5
Averaging 11.8 12.0 12.9 11.8 12.5 13.0
Bayesian 12.0 12.3 12.6 11.9 12.7 13.3
LOS 11.6 12.2 12.9 12.0 12.4 13.0
BADD 11.7 12.1 12.4 13.0 14.1 15.6
improvement in classi®cation accuracy is obtained. cient for the saturated learning was q 0:938.
On the other hand, the obtained classi®cation er- Thus, in the SL case, the networks converge al-
ror is quite close to the theoretical one. Therefore, most to the same solution.
any improvements are hardly achieved. The com- In practice, learning sets are usually very lim-
parison of the middle and the right-hand parts of ited. Therefore, in the next test, we reduced the
Table 4 shows that there is only a small dierence learning sets to 150 samples. Table 5 summarises
between the results of the early stopped and the the results of the test. As can be seen from the
saturated learning. This is not surprising, since the table, all the combination schemes improve clas-
number of training samples exceeds the number of si®cation performance when the early stopping is
weights nearly 40 times. Amari et al. (1996) points applied. In the saturated learning case, the use-
out that for large networks if the number of fulness of the BADD combination scheme can be a
training samples exceeds the number of weights point of contention. The BADD training scheme
more than 30 times, the early stopping is not ef- divides the input space into several regions. It
fective. The mean value of the correlation coe- seems that for the small training sets, reduction of
the size of the training regions may deteriorate the tion. Table 6 shows that this is the case. The op-
ability of the networks to generalise. timised Choquet integral, the LOS operator and
The mean values of the correlation coecient the weighted averaging approach yielded the best
between the outputs of the dierent networks were performance. The BADD combination scheme,
q 0:839 and q 0:674 for the ES and SL cases, again, performs well for the regularised networks
respectively. If the values of the correlation coef- trained on the large learning sets. For the small
®cients for the ES and SL cases are compared, it is training sets, the proposed combination schemes
possible to see that, for the large training set, SL with data-dependent weights are, once again,
increases correlation between the networks. For among the best approaches for networks fusion
the small training set, by contrast, SL decreases the (the values in bold in the tables). The obtained
correlation. Combining less correlated networks performance of the committee can be compared
may increase the performance of the committee. with the 2:8 0:7% error rate achieved in the
For this data, however, this increase does not ELENA project using MLP of the above-men-
compensate the performance decrease that occurs tioned structure. For the Concentric data, the
due to over-training. saturated learning altered the correlation between
Table 5 shows that the proposed approaches the outputs of the dierent networks in the same
for combining multiple networks (the shadowed way as for the Clouds data. The saturated learning
lines in the table) yielded the best performances. increased the correlation coecient for the large
The performances achieved can be compared with training sets and decreased the coecient for the
the 12.3% error rate obtained in the ELENA small training sets.
project. Note that the network used in the ELE-
NA project had 292 weights and was trained on 4.3. The Satimage data
2500 data points. By contrast, each network in this
study was trained on 150 data points and had 52 There are six classes of data with signi®cantly
weights (260 weights in total for the ®ve net- diering number of instances in the Satimage da-
works). tabase. The members of the committee trained on
the large data sets were also rather diverse in their
4.2. The Concentric data performances. The accuracy of the leader was
much higher than the accuracy of the other
For the Concentric data, by contrast, only one members of the committee. We included an equal
trial was performed when training each MLP. number of instances from the dierent classes (25)
Therefore, members of the committee were much when constructing the small learning sets. For the
more diverse in their performances for the small training sets, performances of the committee
Concentric data than for the Clouds data. The best members were less diverse than for the large
member of the committee performed much better training sets.
than the others. Tables 6 and 7 summarise the The performance evaluation results for the
results of the tests for the Concentric data when Satimage data are given in Tables 8 and 9. As can
learning sets contained 1000 and 100 training be seen from Table 8 the early stopping is again
samples, respectively. As can be seen from Ta- not eective. The over-training is hardly observed
bles 6 and 7, the committee of the diverse members on the best single network. Moreover, except for
combined by the Majority rule or the Borda count the Majority rule, all the combination schemes
performs much worse than the best member of the perform better when combining the ``over-
committee. For the large training sets, a very low trained'' networks. Note that the saturated
classi®cation error was obtained from the best learning has raised the correlation between the
single network. Therefore, we can expect that dierent networks for the large training sets and
combination schemes that use optimisation tech- has lowered the correlation for the small training
niques to estimate the combination weights will sets. The results obtained once again point out
perform better than those without any optimisa- that the Majority rule and the Borda count are
Table 6
Performance of various combination schemes for the Concentric data when training sets consist of 1000 samples
Combination schemes\Errors % Min Mean Max Min Mean Max
Majority rule 0.3 1.9 8.0 0.3 2.2 9.3
Borda count 0.3 1.9 8.0 0.3 2.2 9.3
Averaging 0.2 0.4 0.7 0.2 0.5 0.8
Bayesian 0.3 1.7 7.0 0.2 1.8 6.9
LOS 0.2 0.4 0.5 0.2 0.4 0.6
BADD 0.2 0.4 0.6 0.3 0.6 0.9
Table 7
Performance of various combination schemes for the Concentric data when training sets consist of 100 samples
Combination schemes\Errors % Min Mean Max Min Mean Max
Majority rule 2.0 9.7 36.3 2.0 7.0 23.7
Borda count 2.0 9.7 36.3 2.0 7.0 23.7
Averaging 2.1 2.7 3.3 2.1 2.4 3.2
Bayesian 2.0 3.7 5.8 2.2 8.0 23.4
LOS 1.9 2.6 3.2 2.2 2.5 3.1
BADD 3.2 3.5 4.4 3.7 4.5 5.4
Table 8
Performance for the Satimage data when training sets consist of 2500 samples
Majority rule 13.3 15.3 18.6 13.7 18.4 30.1
Borda count 15.6 19.2 27.5 15.4 18.5 24.8
Averaging 12.0 13.1 15.1 11.7 12.5 13.9
Bayesian 12.3 14.2 19.0 12.5 12.7 12.9
LOS 11.9 12.5 14.0 11.5 11.9 12.6
BADD 12.1 12.5 13.2 11.3 11.6 12.0
Table 9
Performance of various combination schemes for the Satimage data when training sets consist of 150 samples
Majority rule 12.9 13.9 15.3 13.7 14.1 14.4
Borda count 13.0 14.8 18.0 14.0 14.4 14.8
Averaging 12.4 13.4 14.8 13.1 13.5 13.7
Bayesian 13.5 14.8 15.7 14.5 15.4 15.8
LOS 12.6 13.4 14.5 12.9 13.4 13.7
BADD 13.5 14.0 15.0 13.4 14.1 14.8
bad choices for combining the diverse members of provement in classi®cation accuracy if compared
the committee. with the best single network.
Though there is no signi®cant dierence be-
tween the results obtained from the several com- 4.4. The Phoneme data
bination schemes, the proposed approaches are
among the best. It should be mentioned, however, The performance evaluation results for the
that the BADD combination scheme is not useful Phoneme data are given in Table 10. Though the
if only a small number of training samples is dierent combination schemes show very similar
available. The maximum 12.0% error rate ob- performances on the Phoneme data, the lowest
tained can be compared with the 12.9% error rate classi®cation error was achieved by using the
achieved in the ELENA project when using a weighted averaging with data-dependent weights.
single network of the above mentioned architec- We recall that the error rate achieved in the
ture. The committee shows a quite impressive im- ELENA project was equal to 16:4 1:3%. Again,
Table 10
Performance of various combination schemes for the Phoneme data when training sets consist of 2500 samples
Majority rule 14.5 15.3 16.5 15.0 15.5 16.2
Borda count 14.5 15.3 16.5 15.0 15.5 16.2
Averaging 14.8 15.3 15.9 14.7 15.2 16.0
Bayesian 14.9 15.6 16.5 15.3 15.6 16.0
LOS 14.7 15.2 15.6 14.9 15.3 16.0
BADD 14.7 15.2 15.5 14.2 15.0 15.3
the BADD combination scheme performed well weighted averaging approach was very close to
for the large training sets. that obtained from the optimised Choquet inte-
gral. On average, the regularisation applied im-
proved performance of the committees.
5. Discussion and conclusions Zimmermann's compensatory operator and the
LOS operator have shown a slightly lower per-
We have presented four schemes for soft fusion formance than the optimised Choquet integral and
of the outputs of multiple neural classi®ers. The the weighted averaging approach. The optimised
weights assigned to the classi®ers or groups of Choquet integral yielded a lower error rate than
them are data dependent in the ®rst three ap- the one with the subjectively determined fuzzy
proaches. The ®rst approach involves the calcula- densities. Therefore, we can expect to further im-
tion of fuzzy integrals. The second scheme prove the approaches with data-dependent weights
performs weighted averaging with data-dependent and fuzzy densities by using optimisation tech-
weights. The third scheme performs linear combi- niques to estimate the combination weights.
nation of the outputs of classi®ers via the BADD Our conclusion about the usefulness of the
defuzzi®cation strategy. In the last scheme, the Majority rule, the Borda count, and the Bayesian
outputs of multiple classi®ers are combined using approaches for fusing correlated and diverse in
Zimmermann's compensatory operator. An em- performance networks is negative, especially, if
pirical evaluation using widely accessible data sets there is a clear leader among the networks being
substantiates the validity of the approaches with combined.
data-dependent weights compared to various ex- A higher performance of the committee of
isting combination schemes of multiple classi®ers. networks could probably be achieved using the
The majority rule, the Borda count, combination bootstrapping techniques for constructing the
by averaging, the weighted averaging, the Bayesian training sets. Bootstrapping was not used, how-
combination, the linear combination of order sta- ever, since the goal was to compare various com-
tistics and combination by fuzzy integral have bining techniques, not to achieve a classi®cation
been used for the comparison. error as small as possible.
Combination by the weighted averaging with
data-dependent weights has shown the best overall
performance. Though the Choquet fuzzy integral Acknowledgements
with data-dependent densities was among the best
approaches for networks fusion, the overall clas- We gratefully acknowledge the support we have
si®cation accuracy achieved by using this ap- received from The Swedish National Board for
proach was slightly lower than the accuracy Industrial and Technical Development. We also
obtained from the weighted averaging with data- thank two anonymous reviewers for their valuable
dependent weights. For the large training sets, the comments.
BADD combination scheme was also among the
best approaches. For the small training sets,
however, this combination scheme was not a good References
choice.
Amari, S., Murata, N., Muller, K.-R., Finke, M., Yang, H.,
Optimisation techniques have been used to es- 1996. Statistical theory of overtraining ± Is cross-validation
timate the combination weights for the following asymptotically eective?. In: Touretzky, D.S., Mozer, M.C.,
four approaches: Zimmermann's compensatory Hasselmo, M.E. (Eds.), Advances in Neural Information
operator, the weighted averaging, the optimised Processing Systems 8. MIT Press, Cambridge, pp. 176±182.
Battiti, R., Colla, M., 1994. Democracy in neural nets: Voting
Choquet integral and the LOS. The optimised
schemes for classi®cation. Neural Networks 7 (4), 691±707.
Choquet integral has shown the best overall per- Ceccarelli, M., Petrosino, A., 1997. Multi-feature adaptive
formance amongst these four approaches. The classi®ers for SAR image segmentation. Neurocomputing
correct recognition rate obtained from the 14, 345±363.
Chen, W., Gader, P.D., Shi, H., 1997. Improved dynamic dency and its application to combining multiple classi®ers.
programming-based handwritten word recognition using Pattern Recognition Letters 18, 515±523.
optimal order statistics. In: Proc. of the Internat. Conf. Kittler, J., Hojjatoleslami, A., Windeatt, T., 1997a. Strategies
Statistical and Stochastic Methods in Image Processing II. for combining classi®ers employing shared and distinct
San Diego, pp. 246±256. pattern representations. Pattern Recognition Letters 18,
Chi, Z., Yan, H., Pham, T.D., 1996. Fuzzy Algorithms: With 1373±1377.
Applications to Image Processing and Pattern Recognition. Kittler, J., Matas, J., Jonsson, K., Ramos S anchez, M.U.,
World Scienti®c, Singapore. 1997b. Combining evidence in personal identity veri®cation
Cho, S.B., Kim, J.H., 1995. Combining multiple neural systems. Pattern Recognition Letters 18, 845±852.
networks by fuzzy integral for robust classi®cation. IEEE Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., 1998. On
Transactions on System, Man, and Cybernetics 25 (2), 380± combining classi®ers. IEEE Transactions on Pattern Anala-
384. lysis and Machine Intelligence 20 (3), 226±239.
Denoeux, T., 1995. A k-nearest neighbor classi®cation rule Krishnapuram, R., Lee, J., 1992. Fuzzy-set-based hierarchical
based on Dempster±Shafer theory. IEEE Transactions on networks for information fusion in computer vision. Neural
System, Man and Cybernetics ± Part B 25 (5), 804±813. Networks 4, 335±350.
Efron, B., Tibshirani, R., 1993. An Introduction to the Krogh, A., Vedelsby, J., 1995. Neural network ensembles, cross
Bootstrap. Chapman and Hall, London. validation, and active learning. In: Tesauro, G., Touretzky,
Filev, D., Yager, R.R., 1991. A generalized defuzzi®cation D.S., Leen, T.K. (Eds.), Advances in Neural Information
method under BADD distributions. International Journal Processing Systems 7. MIT Press, Cambridge, pp. 231±238.
of Intelligent Systems 6, 687±697. Kuncheva, L., 1997. An application of OWA operators to the
Gader, P.D., Mohamed, M.A., Keller, J.M., 1996. Fusion of aggregation of multiple classi®cation decisions. In: Yager,
handwritten word classi®ers. Pattern Recognition Letters R., Kacprzyk, J. (Eds.), The Ordered Weighted Averaging
17, 577±584. Operators. Theory and Applications. Kluwer Academic
Grabisch, M., 1995. Fuzzy integral in multicriteria decision Publishers, Dordrecht, pp. 330±343.
making. Fuzzy Sets and Systems 69, 279±298. Kuncheva, L., Bezdek, J., Sutton, M., 1998. On combining
Grabisch, M., 1996. The representation of importance and multiple classi®ers by fuzzy templates. In: Proceedings of
interaction of features by fuzzy measures. Pattern Recog- the 1998 Annual Meeting of the North American Fuzzy
nition Letters 17, 567±575. Information Processing Society, NAFIPS'98, Pensacola FL,
Grabisch, M., Nicolas, J.-M., 1994. Classi®cation by fuzzy pp. 193±197.
integral: Performance and tests. Fuzzy Sets and Systems 65, Lam, L., Suen, C.Y., 1995. Optimal combination of pattern
255±271. classi®ers. Pattern Recognition Letters 16, 945±954.
Hashem, S., 1997. Optimal linear combinations of neural Le Hegarat-Mascle, S., Bloch, I., Vidal-Madjar, D., 1998.
networks. Neural Networks 10 (4), 599±614. Introduction of neighborhood information in evidence
Heskes, T., 1997. Balancing between bagging and bumping. In: theory and application to data fusion of radar and optical
Mozer, M.C., Jordan, M.I., Petsche, T. (Eds.), Advances in images with partial cloud cover. Pattern Recognition 31
Neural Information Processing Systems 9. MIT Press, (11), 1811±1823.
Cambridge, pp. 466±472. Merz, C.J., Pazzani, M.J., 1997. Combining neural network
Ho, T.K., Hull, J.J., Srihari, S.N., 1994. Decision combination regresion estimates with regularized linear weights. In:
in multiple classi®er systems. IEEE Transaction on Pattern Mozer, M.C., Jordan, M.I., Petsche, T. (Eds.), Advances
Analysis and Machine Intelligence 16 (1), 66±75. in Neural Information Processing Systems 9. MIT Press,
Hocaoglu, A.K., Gader, P.D., 1998. Choquet integral repre- Cambridge, pp. 564±570.
sentations of nonlinear ®lters with applications to LAD- Mirhosseini, A.R., Yan, H., Lam, K.-M., Pham, T., 1998.
AR image processing. In: Proceedings of the SPIE Human face recognition: an evidence aggregation approach.
Conference Nonlinear Image Processing IX, San Jose, Computer Vision and Image Understanding 71 (2), 213±
CA, pp. 66±72. 230.
Jacobs, R.A., 1995. Methods for combining experts probability Munro, P.W., Parmanto, B., 1997. Competition among net-
assessments. Neural Computation 7 (5), 867±888. works improves committee performance. In: Mozer, M.C.,
Ji, C., Ma, S., 1997. Combined weak classi®ers. In: Mozer, Jordan, M.I., Petsche, T. (Eds.), Advances in Neural
M.C., Jordan, M.I., Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9. MIT Press, Cambridge,
Information Processing Systems 9. MIT Press, Cambridge, pp. 592±598.
pp. 494±500. Perrone, M.P., Cooper, L.N., 1993. When networks disagree:
Jutten, C., Guerin-Dugue, A., Aviles-Cruz, C., Voz, J.L., Van Ensamble method for neural networks. In: Mammone, R.J.
Cappel, D., 1995. ESPRIT basic research project number (Ed.), Neural Networks for Speech and Image Processing.
6891 ELENA, Enhanced learning for evolutive neural Chapman and Hall, London.
architecture. Pham, T., Yan, H., 1997. Fusion of handwritten numeral
Kang, H.-J., Kim, K., Kim, J.H., 1997. Optimal approximation classi®ers based on fuzzy and genetic algorithms. In:
of discrete probability distribution with kth-order depen- Proceedings of the 1997 Annual Meeting of the North
American Fuzzy Information Processing Society, NA- Verikas, A., Malmqvist, K., Bergman, L., Signahl, M., 1998.
FIPS'97, New York, pp. 257±262. Colour classi®cation by neural networks in graphic arts.
Rogova, G., 1994. Combining the results of several neural Neural Computing and Applications 7, 52±64.
network classi®ers. Neural Networks 7 (5), 777±781. Wang, L., Chen, K., Chi, H., 1997. Methods of linear
Shi, H., Gader, P.D., Chen, W., 1998. Fuzzy integral ®lters: combination based on dierent features. In: Proceedings
Properties and parallel implementation. Real-Time Imaging of the International Conference on Neural Information
4 (4), 233±241. Processing, Dunedin, New Zealand, Vol. 2, pp. 1088±1091.
Sollich, P., Krogh, A., 1996. Learning with ensembles: How Waterhouse, S., Cook, G., 1997. Ensemble methods for
over-®tting can be useful. In: Touretzky, D.S., Mozer, phoneme classi®cation. In: Mozer, M.C., Jordan, M.I.,
M.C., Hasselmo, M.E. (Eds.), Advances in Neural Infor- Petsche, T. (Eds.), Advances in Neural Information Pro-
mation Processing Systems 8. MIT Press, Cambridge, pp. cessing Systems 9. MIT Press, Cambridge, pp. 800±806.
190±197. Woods, K., Kegelmeyer, W.P., Bowyer, K., 1997. Combination
Sugeno, M., 1977. Fuzzy measures and fuzzy integrals: A of multiple classi®ers using local accuracy estimates. IEEE
survey. In: Automata and Decision Making. North Hol- Transactions on Pattern Analysis and Machine Intelligence
land, Amsterdam, pp. 89±102. 19 (4), 405±410.
Tahani, H., Keller, J.M., 1990. Information fusion in computer Xu, L., Krzyzak, A., Suen, C.Y., 1992. Methods for combining
vision using the fuzzy integral. IEEE Transactions on multiple classi®ers and their applications to handwriting
Systems, Man and Cybernetics 20 (3), 733±741. recognition. IEEE Transactions on Systems, Man, and
Taniguchi, M., Tresp, V., 1997. Averaging regularized estima- Cybernetics 22 (3), 418±435.
tors. Neural Computation 9, 1163±1178. Yager, R.R., Filev, D.P., 1994. On a fexible structure for fuzzy
Tresp, V., Taniguchi, M., 1995. Combining estimators using systems models. In: Yager, R.R., Zadeh, L.A. (Eds.), Fuzzy
non-constant weighting functions. In: Tesauro, G., Tour- Sets, Neural Networks, and Soft Computing. Van Nostrand
etzky, D.S., Leen, T.K. (Eds.), Advances in Neural Infor- Reinhold, New York.
mation Processing Systems 7. MIT Press, Cambridge. Zimmermann, H.J., Zysno, P., 1980. Latent connectives in
Verikas, A., Signahl, M., Malmqvist, K., Bacauskiene, M., human decision making. Fuzzy Sets and Systems 4 (1), 37±
1997. Fuzzy committee of experts for segmentation of 51.
colour images. In: Proceedings of Fifth European Congress Zimmermann, H.J., Zysno, P., 1984. Decisions and evaluations
on Intelligent Techniques and Soft Computing, Aachen, by hierarchical aggregation of information. Fuzzy Sets and
Germany, Vol. 3, pp. 1902±1906. Systems 10 (3), 243±260.

Soft Combination of Neural Classi®ers: A Comparative Study

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Soft Combination of Neural Classi®ers: A Comparative Study

Enviado por

Direitos autorais:

Formatos disponíveis

Pattern Recognition Letters 20 (1999) 429±444