American Society for Quality
A Fast Algorithm for the Minimum Covariance Determinant Estimator
Author(s): Peter J. Rousseeuw and Katrien van Driessen
Source: Technometrics, Vol. 41, No. 3 (Aug., 1999), pp. 212223
Published by: Taylor & Francis, Ltd. on behalf of American Statistical Association and American
Society for Quality
Stable URL: http://www.jstor.org/stable/1270566
Accessed: 01112015 04:37 UTC
REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/1270566?seq=1&cid=pdfreference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a notforprofit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.
Taylor & Francis, Ltd., American Statistical Association and American Society for Quality are collaborating with
JSTOR to digitize, preserve and extend access to Technometrics.
http://www.jstor.org
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FastAlgorithm
fortheMinimum
Covariance
Determinant
Estimator
Peter J. ROUSSEEUW
Katrien VANDRIESSEN
Departmentof Mathematics
and ComputerScience
Universitaire
InstellingAntwerpen
1
Universiteiteitsplein
B2610 Wilrijk
Belgium
(peter.rousseeuw@uia.ua.ac.be)
Facultyof AppliedEconomics
Universitaire
Faculteiten
Sint Ignatius
Prinsstraat13
B2000 Antwerp
Belgium
(katrien.
vandriessen@ufsia.ac. be)
Theminimum
covariance
determinant
ofRousseeuw
is a highly
robust
estimator
(MCD) method
of
multivariate
location
andscatter.
Itsobjective
is tofindh observations
(outofn) whosecovariance
matrix
has thelowestdeterminant.
Untilnow,applications
of theMCD werehampered
by the
timeof existing
whichwerelimited
to a fewhundred
computation
algorithms,
objectsin a few
dimensions.
Wediscusstwoimportant
oflarger
size,oneabouta production
applications
processat
anda dataset
fromastronomy
withn = 137,256
Philipswithn = 677 objectsandp = 9 variables,
To deal withsuchproblems
we havedeveloped
a newalgorithm
objectsandp = 27 variables.
fortheMCD, calledFASTMCD.Thebasicideasarean inequality
orderstatistics
and
involving
andtechniques
whichwecall"selective
iteration"
and"nested
extensions."
determinants,
Forsmall
FASTMCDtypically
findstheexactMCD, whereasforlargerdatasetsit givesmore
datasets,
accurate
resultsthanexisting
andis faster
FASTalgorithms
Moreover,
byordersof magnitude.
MCD is ableto detectan exactfitthatis,a hyperplane
h or moreobservations.
The
containing
newalgorithm
makestheMCD method
availableas a routine
toolforanalyzing
multivariate
data.
Wealsoproposethedistancedistance
MCDbasedrobust
distances
plot(DD plot),whichdisplays
versusMahalanobis
andillustrate
itwithsomeexamples.
distances,
KEY WORDS: Breakdown
location
andscatter;
Outlier
value;Multivariate
detection;
Regression;
Robustestimation.
It is difficult
to detectoutliersin pvariate
data when Positivebreakdown
methods
suchas theMVE andleast
2
>
because
one
trimmed
can
no
(Rousseeuw1984)areincreasp
squaresregression
longerrelyon visualinspection.Although
itis stillquiteeasyto detecta singleoutlier inglybeingusedinpracticefor
infinance,
chemexample,
electrical
andcomputer
thisapproachno istry,
engineering,
processcontrol,
distances,
by meansof theMahalanobis
andKim 1991).Fora surformultiple
outliers
becauseofthemasking vision(Meer,Mintz,Rosenfeld,
longersuffices
of
methods
and somesubstantive
vey
positivebreakdown
which
outliers
do notnecessarily
effect,
have
by
multiple
see
Rousseeuw
applications,
(1997).
largeMahalanobisdistances.It is betterto use distances
forapproximating
the
algorithm
basedonrobust
estimators
ofmultivariate
locationandscat The basic resampling
called
was
MVE,
MINVOL,
proposedby Rousseeuwand
ter(Rousseeuw
andLeroy1987,pp.265269).Inregression
This
considersa trialsubsetof
(1987).
Leroy
algorithm
robust
distances
fromtheexplanatory
analysis,
computed
1
observations
and
+
calculatesitsmeanandcovariance
variablesallowus to detectleveragepoints.Moreover,
ro p
matrix.
The
or decorresponding
ellipsoidis theninflated
bustestimation
of multivariate
locationand scatteris the
flatedto containexactlyh observations.
This
is
procedure
othermultivariate
suchas
keytoolto robustify
techniques
times,andtheellipsoidwiththelowestvolrepeated
many
principalcomponent
analysisanddiscriminant
analysis. umeis retained.
Forsmalldatasets
itis possibletoconsider
forestimating
multivariate
locationand all subsetsofsizep+ 1,whereas
Manymethods
forlargerdatasets
thetrial
scatterbreakdownin thepresenceof n/(p+ 1) outliers, subsetsaredrawnat random.
wheren is thenumber
of observations
andp is thenum Severalotheralgorithms
havebeenproposedto approxberofvariables,
as waspointed
outbyDonoho(1982).For imatetheMVE. Woodruff
and Rocke(1993) constructed
thebreakdown
valueof themultivariate
Mestimators
of algorithms
the
withthree
combining resampling
principle
Maronna(1976),see Hampel,Ronchetti,
and heuristic
searchtechniquessimulated
Rousseeuw,
annealing,
genetic
Stahel(1986,p. 296). In themeantime,
severalpositive algorithms,
and tabusearch.Otherpeopledevelopedalbreakdown
estimators
of multivariate
to computetheMVE exactly.Thisworkstarted
locationand scatter gorithms
havebeenproposed.
with
the
One of theseis theminimum
ofCook,Hawkins,
andWeisberg
volume
algorithm
(1992),
method
of
Rousseeuw
ellipsoid(MVE)
(1984,p. 877; 1985).
Thisapproachlooksfortheellipsoidwithsmallest
volume
? 1999American
Statistical
Association
thatcoversh datapoints,wheren/2 < h < n. Its breakandtheAmerican
forQuality
Society
downvalueis essentially
(n  h)/n.
AUGUST1999,VOL.41, NO.3
TECHNOMETRICS,
212
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
213
Problem 1 (Engineering). We are gratefulto Gertjan
whichcarriesout an ingeniousbutstillexhaustivesearchof
all possible subsetsof size h. In practice,thiscan be done Ottenforprovidingthefollowingproblem.PhilipsMecoma
for n up to about 30. Recently,Agullo (1996) developed (The Netherlands),is producingdiaphragmpartsfor TV
an exact algorithmforthe MVE thatis based on a branch sets. These are thinmetal plates,molded by a press. Reand bound procedurethatselects the optimalsubsetwith centlya new productionline was started,and foreach of
out requiringthe inspectionof all subsetsof size h. This n = 677 parts,nine characteristics
were measured.The
is substantially
fasterand can be applied up to (roughly) aim of the multivariate
analysisis to gain insightintothe
n < 100 and p < 5. Because for most datasetsthe exact
betweenthenine
productionprocessand the interrelations
would taketoo long,theMVE is typicallycom measurements
algorithms
and to findout whetherdeformations
or abputedby versionsof MINVOLfor example,in SPLUS normalitieshave occurredand why. Afterward,the esti(see thefunctioncov.mve).
mated location and scattermatrixcan be used for multiPresentlythereare severalreasonsforreplacingtheMVE variatestatisticalprocesscontrol.
(MCD) estimaby the minimumcovariance determinant
Due to thesupportof HermanVeraaand FransVan Domwhich
was
also
Rousseeuw
tor,
(1984, p. 877;
proposedby
melen(at PhilipsPMF/Mecoma,ProductEngineering,
P.O.
1985). The MCD objectiveis to findh observations(out
Box 218, 5600 MD Eindhoven,The Netherlands),we obof n) whose classical covariancematrixhas the lowestdeterminant.
The MCD estimateof location is thenthe av tainedpermissionto analyze thesedata and to publishthe
results.
erage of these h points,and the MCD estimateof scatter
Figure 1 showstheclassical Mahalanobisdistance
is theircovariancematrix.The resultingbreakdownvalue
equals thatof the MVE, but the MCD has severaladvan(1.1)
MD(xi) = J(x To)'So(xiTo)
is betterbetages over the MVE. Its statisticalefficiency
cause the MCD is asymptotically
normal(Butler,Davies,
versustheindexi, whichcorrespondsto theproductionseand Jhun1993), whereas the MVE has a lower converTo is the arithmetic
quence. Here xi is ninedimensional,
gence rate (Davies 1992). As an example,the asymptotic
and
is
the
classical
covariance
matrix.The horiSo
of theMCD scattermatrixwiththetypicalcover mean,
efficiency
zontalline is at theusual cutoffvalue VX
x975= 4.36.
age h = .75n is 44% in 10 dimensions,and thereweighted
covariancematrixwith weightsobtainedfromthe MCD
In Figure1 it seemsthatmostobservationsare consistent
attains83% of efficiency
normalmodel,exceptfora
(Croux and Haesbroeckin press), withthe classical multivariate
whereasthe MVE attains0%. The MCD's betteraccuracy few isolated outliers.This shouldnot surpriseus, even in
makes it veryusefulas an initialestimateforonestepre thefirstexperimental
runof a new productionline because
estimators
and
Carroll
1992;
gression
(Simpson,Ruppert,
theMahalanobisdistancesare knownto sufferfrommask1993). Robustdistancesbased ing. That is, even if therewere a groupof outliers(here,
Coakleyand Hettmansperger
on theMCD are moreprecisethanthosebased on theMVE deformed
diaphragmparts)theywould affectTo and So in
and hence bettersuited to expose multivariateoutliers such a
way as to become invisiblein Figure 1. To further
forexample,in the diagnosticplot of Rousseeuw and van
investigatethesedata,we need robustestimatorsT and S,
Zomeren(1990), whichdisplaysrobustresidualsversusrowitha substantialstatisticalefficiency
so thatwe
preferably
bust distances.Moreover,the MCD is a key component can be confidentthat
effects
that
become
visible
any
may
of the hybridestimatorsof Woodruffand Rocke (1994)
are real and notdue to theestimator'sinefficiency.
Afterdeand Rocke and Woodruff(1996) and of highbreakdown
theFASTMCD algorithm,
we will returnto these
veloping
linear discriminantanalysis (Hawkins and McLachlan
data in Section7.
1997).
In spiteof all these advantages,untilnow theMCD has
rarelybeen appliedbecause itwas harderto compute.In this
thatis
article,however,we constructa new MCD algorithm
0
The
actuallymuchfasterthananyexistingMVE algorithm.
new MCD algorithmcan deal witha sample size n in the (D
tensof thousands.As faras we know,none of theexisting Cr0
MVE algorithmscan cope with such large sample sizes.
Because the MCD now greatlyoutperforms
the MVE in CO
termsof both statisticalefficiency
and computationspeed,
owe recommendtheMCD method.
Cl
Ca
1. MOTIVATINGPROBLEMS
Two recentproblemswill be shownto illustratetheneed
fora fast,robustmethodthatcan deal withmanyobjects(n)
and/ormanyvariables(p) while maintaininga reasonable
statisticalefficiency.
200
Index
400
600
Figure 1. Plot of Mahalanobis Distances forthe PhilipsData.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN
214
Problem2 (Physical Sciences). A group of astronomersat theCaliforniaInstituteof Technologyare working
on the Digitized Palomar Sky Survey(DPOSS); for a full
see theirreport(Odewahn,Djorgovsky,Brundescription,
and
Gal
1998). In essence, they make a surveyof
ner,
celestialobjects (lightsources)forwhichtheyrecordnine
characteristics
(such as magnitude,area, image moments)
in each of threebandsblue, red,and nearinfrared.
They
seek collaborationwith statisticiansto analyze theirdata,
and gave us access to a partof theirdatabase,containing
137,256 celestialobjects withall 27 variables.
We startedby using quantilequantileplots, BoxCox
transforms,
selectingone variable out of threevariables
linearcorrelation,
and othertools of data
withnearperfect
analysis.One of theseavenuesled us to studysix variables
(two fromeach band). Figure2 plots theMahalanobisdistances (1.1) for these data (to avoid overplotting,
Fig. 2
shows only 10,000 randomlydrawnpointsfromthe entire
plot).The cutoffis X ,.975 = 3.82. In Figure2 we see two
12,plus
groupsof outlierswithMD(xi)  9 and MD(xi)
some outliersstillfurther
away.
Returningto the data and theirastronomicalmeaning,
it turnedout thatthese were all objects for whichone or
morevariablesfell outsidetherangeof whatis physically
possible.So, theMD(xi) did help us to findoutliersat this
stage. We thencleaned the data by removingall objects
whichreduced
witha physicallyimpossiblemeasurement,
our sample size to 132,402. To these data we thenagain
applied the classical mean To and covarianceSo, yielding
theplot of Mahalanobisdistancesin Figure3.
Figure3 looks innocent,like observationsfromthe x/
as ifthedata wouldforma homogeneouspopudistribution,
lation(whichis doubtfulbecause we knowthatthedatabase
we
containsstarsas well as galaxies). To proceedfurther,
estimatesT and S and an algorithm
need highbreakdown
thatcan computethemforn = 132,402.Such an algorithm
in thenextsections.
will be constructed
L0
*
0
5)
.o
*.
**
:0
0 00
C
CuD
*.
Se
.0 c,
go
's
*I
,*
.:,
.? *'
Or
1.*
.:.
..
.0
.*.
Ve
00 to
*
,
*.
*o
......
o
0
2000
4000
Index
6000
C])
tM(O 
..,
?6 %
C'se???
0
o
,
m
e ?
,.
.?
2000
4000
10000
8000
6000
Figure3. DigitizedPalomar Data: Plot of Mahalanobis Distances of
Celestial Objects as in Figure2 AfterRemoval of PhysicallyImpossible
Measurements.
BASIC THEOREM AND THE CSTEP
A key step of the new algorithmis the factthat,startto the MCD, it is possible to
ing fromany approximation
withan even lower detercomputeanotherapproximation
minant.
2.
of
Theorem1. Consider a dataset
Xn
{x,...,x?}
pvariate observations. Let H c {1,... n} with IH = h,
and putT1 := (1/h) Ee
xi andS1 :(l/h) EiCHl (Xi
Ti)(xi  T1)'. If det(Si) : 0, definetherelativedistances
di)
:
/(xi T
(xe )'ie
T1)
Now take H2 such that {di(i);Zi
(dl)h:ln}, where (di)i:n
(dli)2:n <
for i =
,...,n.
H2} := {(di)i: n ...
< (dnl)nn are the
...
ordereddistances,and computeT2 and S2 based on H2.
Then
det(S2) < det(Si)
withequalityif and onlyif T2 = T1 and S2 = S1.
The proofis givenin theAppendix.Althoughthistheoremappearsto be quitebasic, we have been unableto find
it in theliterature.
The theoremrequiresthatdet(Si)  0, whichis no real
restriction
because if det(Si) = 0 we alreadyhave theminimal objectivevalue. Section5 will explainhow to interpret
theMCD in such a singularsituation.
If det(Si) > 0, applyingthe theoremyields S2 with
det(S2) < det(Si). In our algorithmwe will referto the
in Theorem1 as a Cstep,whereC standsfor
construction
because we concentrateon the h observa"concentration"
tionswithsmallestdistancesand S2 is more concentrated
thanS1. In algorithmic
(has a lowerdeterminant)
terms,the
.Cstep can be describedas follows.
Given the hsubset Hold or the pair (Told, Sold), perform
8000 10000 the
following:
Figure2. DigitizedPalomar Data: Plot of Mahalanobis Distances of
Celestial Objects, Based on Six Variables ConcerningMagnitudeand
Image Moments.
1. Computethedistancesdold(i) fori = 1,... ,n.
7 for
2. Sortthesedistances,whichyieldsa permutation
which dold(7r(1)) < dold(W(2)) < .'' < dOld(7r(n)).
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
3. Put Hnew: {r(1),
7r(2)...,
(h)}.
4. ComputeTiew := ave(Hnew)and Snew:= cov(Hnew).
For a fixed numberof dimensionsp, the Cstep takes
in O(n)
only O(n) time[because Hnewcan be determined
without
all
the
operations
sorting
dold(i) distances].
Repeating Csteps yields an iteration process. If
det(S2) = 0 or det(S2) = det(S1), we stop; otherwise,
we runanotherCstepyieldingdet(S3), and so on. The se
215
1, Hawkinsand Olive (1999) used theCconditionas a preliminaryscreen,followedby case swappingas a technique
fordecreasingdet(S), as in the feasiblesolutionapproach
(Hawkins 1994), whichwill be describedin Section6. The
Cconditiondid notreducethetimecomplexityof thisapproach,but it did reduce the actual computationtime in
withfixedn.
experiments
3. CONSTRUCTION OF THE NEW ALGORITHM
and hence mustconverge.In fact,because thereare only 3.1 CreatingInitialSubsets H,
To applythe algorithmic
finitely
manyhsubsets,theremustbe an indexr suchthat
concept(2.1), we firsthave to
det(Sm) = 0 or det(Smn)= det(S 1), hence convergence decide how to constructtheinitialsubsetsH1. Let us conis reached.(In practice,rn is oftenbelow 10.) Afterward, siderthefollowingtwo possibilities:
runningtheCstepon (Tm, S,) no longerreducesthede1. Draw a randomhsubsetH1.
terminant.
This is notsufficient
fordet(Sm) to be theglobal
2.
Draw a random(p + 1)subsetJ, and thencompute
minimumof theMCD objectivefunction,
butit is a neces:=
To
ave(J) and So := cov(J). [If det(So) = 0, then
sarycondition.
extend
J
by addinganotherrandomobservation,and conTheorem1 thusprovidesa partialidea foran algorithm:
tinueadding observationsuntildet(So) > 0.] Then compute the distancesd(i) := (xi  To)'So(X.  To) for
Take manyinitialchoices of H1 and applyCsteps
i = 1,..., n. Sortthemintodo(7r(1))< ... < do(7r(n))and
to each untilconvergence,
and keep the
putH := {7r(1),... (h)}.
solutionwithlowestdeterminant.
(2.1) Option1 is thesimplest,whereas2 startsliketheMINVOL
algorithm(Rousseeuw and Leroy 1987, pp. 259260). It
Of course,severalquestionsmustbe answeredto make(2.1) would be useless to drawfewerthan + 1
p
points,forthen
operational:How do we generatesets H1 to begin with? So is always singular.
How manyH1 are needed? How do we avoid duplication
When the datasetdoes not containoutliersor deviating
of workbecause severalH1 may yieldthe same solution?
whether(2.1) is
groupsof points,it makes littledifference
Can we do withfewerCsteps?What about large sample
with
or
2.
But
1
because
the
MCD
is a veryrobust
applied
sizes? These matterswill be discussedin thenextsections.
we
have
to
consider
contaminated
datasetsin parestimator,
ticular.
For
we
a
dataset
withn = 400
instance,
generated
1.
MCD
The
subset
H
of X, is separated
Corollary
=
observations
and
2
variables,in which205 observations
p
fromX, \ H by an ellipsoid.
were drawnfromthebivariatenormaldistribution
~
Proof: For the MCD subsetH, and in fact any limit
1 0
0
N2(
of a Cstep sequence,applyingthe Cstep to H yields H
0'0
2)
itself.This meansthatall xi c H satisfy(xi  T)'S (xand theother195 observationswere drawnfrom
 T)}h:n, whereas all xj ?
T) < 9 = {(x  T)'S(x
 T) > 9. Take the ellipsoid
H satisfy(x  T)'S(xj
/'10
2 0
N2
 T) < 9}. Then H c E and
E = {x; (x  T)'S(x
0
0
N[_ _
X, \H C closure(EC).Note thatthereis at least one point
The MCD has its highestpossible breakdownvalue when
xi E H on the boundaryof E, whereastheremay or may
h = [(n + p + 1)/2] (see Lopuhaa and Rousseeuw 1991),
notbe a pointxj , H on theboundaryof E.
whichbecomesh = 201 here.We now apply(2.1) with500
The same resultwas provedby Butleret al. (1993) un startingsets H1. Using option 1 yields a
resulting(T,S)
der the extraconditionthata densityexists.Note thatthe whose 97.5% toleranceellipse is shown in
Figure 4(a).
ellipsoidin Corollary1 containsh observationsbut is not Clearly,this resulthas brokendown due to the contaminecessarilythe smallestellipsoid to do so, which would nateddata. On theotherhand,option2 yieldstheresultin
yield the MVE. We know of no techniquelike the Cstep Figure4(b), whichconcentrateson the majority(51.25%)
forthe MVE estimator;hence,the latterestimatorcannot of thedata.
be computedfasterin thisway.
The situationin Figure4 is extreme,
butitis usefulforilof our work,Hawkins and Olive (1999) lustrative
Independently
purposes.(The same effectalso occursforsmaller
discovereda versionof Corollary1 in thefollowingform: amountsofcontamination,
especiallyin higherdimensions.)
"A necessaryconditionfortheMCD optimumis that,if we Approach1 has failedbecause each randomsubsetH1 concalculate the distanceof each case fromthe locationvec tainsa sizable numberof pointsfromthe
majoritygroupas
tor using the scattermatrix,each coveredcase musthave well as fromthe
minoritygroup,whichfollowsfromthe
smallerdistancethanany uncoveredcase." This necessary law of largenumbers.When
froma bad subsetH1,
starting
conditioncould perhapsbe called the"Ccondition,"
as op the iterationswill not convergeto the major solution.On
posed to theCstepof Theorem1, wherewe provedthata the otherhand,theprobabilityof a (p + 1)subsetwithout
Cstepalwaysdecreasesdet(S). In theabsenceof Theorem outliersis muchhigher,whichexplainswhy2 yieldsmany
quence det(S1) > det(S2) > det(S3) > ... is nonnegative
2_/
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN
216
TOLERANCEELLIPSE (97.5%)
(a)
c)
c
Lu
oo
cu
CM
Xo
o.
CU
10
5
10
X1
15
20
10
20
15
step number
25
30
Figure 5. Covariance Determinantof Subsequent CSteps in the
reduction
Dataset of Figure 4. Each sequence stops when no further
is obtained.
TOLERANCE ELLIPSE (97.5%)
(b)
the dashed lines correspondto nonrobustresults.To get
a clear picture,Figure 5 only shows the first100 starts.
Aftertwo Csteps (i.e., forj = 3), manysubsamplesH3
thatwill lead to the global optimumalreadyhave a rather
The global optimumis a solutionthat
small determinant.
o
thedeof
the
195 "bad" points.By contrast,
containsnone
to
a
false
classification
the
subsets
of
terminants
H3 leading
CM
are considerablylarger.For thatreason,we can save much
timeand stillobtainthesame resultby taking
computation
just two Csteps and retainingonly the (say, 10) best H3
tI
Otherdatasets,also in more disubsetsto iteratefurther.
20
15
10
5
0
5
1( D
these
conclusions.
Therefore,fromnow
mensions,confirm
X1
from
each initialsubsamtwo
will
on we
takeonly
Csteps
Figure 4. Results of IteratingCSteps StartingFrom 500 Random
subsets
the
10
different
H3 withthe lowest
Subsets H1 of (a) Size h = 201 and (b) Size p+ 1 = 3.
ple H1, select
we
continuetakingCthese
10
and
for
determinants, only
until
subsetsH1 consistingof pointsfromthemajorityandhence steps
convergence.
a robustresult.From now on, we will always use 2.
3.3 Nested Extensions
of havingat
Remark. For increasingn, theprobability
For a small sample size n, theprecedingalgorithmdoes
least one "clean" (p + 1)subsetamongm random(p + 1) not take much time.But when n
grows,the computation
subsetstendsto
timeincreases,mainlydue to the n distancesthatneed to
> 0,
1  (1  (1  )P+l)
(3.1) be calculatedeach time.To avoiddoingall thecomputations
in the entiredataset,we will considera special structure.
theproba When n >
wheree is thepercentageof outliers.In contrast,
1,500,the algorithmgeneratesa nestedsystem
bilityof havingat leastone clean hsubsetamongm random of subsetsthatlooks like Figure6, wherethearrowsmean
hsubsetstendsto 0 because h increaseswithn.
"is a subsetof." The fivesubsetsof size 300 do not overand togethertheyformthe mergedset of size 1,500,
lap,
3.2 Selective Iteration
~e
CM
Each Cstep calculates a covariance matrix,its determinant,and all relativedistances.Therefore,reducingthe
numberof Csteps would improvethe speed. But is this
of the algorithm?
possible withoutlosing the effectiveness
betweenrobustsoluIt turnsout thatoftenthe distinction
tionsand nonrobustsolutionsalreadybecomes visibleafter
twoor threeCsteps.For instance,considerthedata of Figure 4 again. The innerworkingsof the algorithm(2.1) are
tracedin Figure5. For each startingsubsampleH1, thedeterminant
of the covariancematrixSj based on h = 201
observationsis plottedversusthe stepnumberj. The runs
yieldinga robustsolutionare shownas solid lines,whereas
( 300 )
( 300 )
( 300 )
( 300 )
(1500 )
.(
n )
( 300 )
Figure 6. Nested System of Subsets Generated by the FASTMCD
Algorithm.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
217
1. The defaulth is [(n + p + 1)/2], but the user may
whichin turnis a propersubsetof thedatasetof size n. [Alreadythe algorithmof Woodruffand Rocke (1994) made choose any integerh with [(n + p + 1)/2] < h < n.
for this purpose.The only difference The programthen reportsthe MCD's breakdownvalue
use of partitioning
withthe nestedextensionsin Fig. 6 is thatwe workwith (n  h + l)/n. If you are sure thatthe dataset contains
whichis usuallythe case, a
two stages,hence our use of the word "nested,"whereas less than25% contamination,
Woodruffand Rocke partitionedthe entiredataset,which good compromisebetweenbreakdownvalue and statistical
is obtainedby puttingh = [.75n].
yieldsmore and/orlargersubsets.]To constructFigure6, efficiency
2. If h = n, thenthe MCD locationestimateT is the
the algorithmdraws 1,500 observations,
one by one, without replacement.The first300 observationsit encounters averageof thewholedataset,and theMCD scatterestimate
are putin thefirstsubset,and so on. Because of thismech S is its covariancematrix.Reporttheseand stop.
3. If p = 1 (univariatedata), computethe MCD estifor
anism,each subsetof size 300 is roughlyrepresentative
of RousseeuwandLeroy
the dataset,and the mergedset with 1,500 cases is even mate(T, S) bytheexactalgorithm
morerepresentative.
(1987, pp. 171172) in O(nlogn) time;thenstop.
4. From here on, h < n and p > 2. If n is small (say,
Whenn < 600,we willkeepthealgorithm
as in theprevi<
n
>
ous section,whileforn 1,500we willuse Figure6. When
600), then
600 < n < 1,500,we willpartition
thedataintoat mostfour
subsetsof 300 or moreobservationsso thateach observation * repeat(say) 500 times:
belongsto a subsetand such thatthe subsetshave roughly
* constructan initialhsubsetH1 usingmethod2 in
the same size. For instance,601 will be splitas 300 + 301
Subsection 3.1that is, startingfrom a random
and 900 as 450 + 450. For n = 901, we use 300 + 300 + 301,
(p + 1)subset;
and we continueuntil1,499 = 375 + 375 + 375 + 374. By
* carryout two Csteps(describedin Sec. 2);
splitting601 as 300+ 301 we do notmeanthatthefirstsubset containstheobservationswithcase numbers1,... ,300
* forthe 10 resultswithlowestdet(S3):
but thatits 300 case numberswere drawnrandomlyfrom
* carryout
Cstepsuntilconvergence;
1,...,601.
Whenevern > 600 (and whethern < 1,500 or not),our
* reportthesolution(T, S) withlowestdet(S).
new algorithm
fortheMCD will taketwoCstepsfromseveral starting
5. If n is larger(say,n > 600), then
subsamplesH1 withineach subset,witha total
of 500 startsforall subsetstogether.For everysubsetthe
* constructup to fivedisjointrandomsubsetsof size
best 10 solutionsare stored.Then the subsetsare pooled,
a
set
with
nsub accordingto Section3.3 (say,fivesubsetsof size
at
most
observations.
Each
1,500
yielding merged
of these(at most50) availablesolutions(TSub,Ssub) is then
nsub = 300);
extendedto the mergedset. That is, startingfromeach
* insideeach subset,repeat500/5= 100 times:
we
continue
(Tsub, Ssub),
takingCsteps,whichnowuse all
* constructan initial subset H1 of size hsub =
1,500observationsin themergedset.Onlythebest 10 solu
tions (Tmerged, Smerged) will be considered further.Finally,
each of these 10 solutionsis extendedto thefulldatasetin
thesame way,and thebestsolution(Tfull,Sfull)is reported.
Because thefinalcomputations
are carriedout in theentiredataset,theytake moretimewhen n increases.In the
interestof speed we can limitthe numberof initialsolu
tions (Tmerged, Smerged) and/or the number of Csteps in
[nsub(h/n)];
*
carryout two Csteps,usingnsuband hsub;
* keep the 10 best results(Tsub,
Ssub);
* pool the subsets,yieldingthemergedset (say,of size
nmerged = 1,500);
* in themergedset,repeatforeach of the 50 solutions
(Tsub, Ssub):
thefulldatasetas n becomes large.
The mainidea of thissubsectionwas to carryoutCsteps
in severalnestedrandomsubsets,startingwithsmall subsets of around300 observationsand endingwiththeentire
datasetof n observations.Throughoutthis subsection,we
havechosenseveralnumberssuchas fivesubsetsof 300 observations,500 starts,10 best solutions,and so on. These
choiceswerebased on variousempiricaltrials(notreported
our choices as defaultsso theuser
here).We implemented
does not have to choose anything,
but of course the user
the
defaults.
may change
* in thefulldataset,repeatforthemfullbest results:
4. THE RESULTING ALGORITHM FASTMCD
Combiningall thecomponentsof theprecedingsections
whichwe will call FASTMCD.
yieldsthe new algorithm,
Its pseudocodelooks as follows:
We willreferto theprecedingas theFASTMCD algorithm.
Note thatit is affineequivariant:When thedata are translated or subjectedto a lineartransformation,
the resulting
The
(Tfull,Sfull)will transform
accordingly. computerprogramcontainstwo moresteps:
* carry out two Csteps, using nmergedand hmerged=
[rmerged(h/n)];
* keep the 10 best results
(Tmerged, Smerged);
* take several
Csteps,usingn and h;
*
the
best
finalresult(Tfull,Sfull).
keep
Here,mfulland thenumberof Csteps(preferably,
until
convergence)dependon how largethedatasetis.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
218
PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN
6. To obtainconsistency
whenthedatacome froma multivariatenormaldistribution,
we put
TMCD=
Tfull
and
(i)
medid2
SMCD
(Tfull,Sfui)
Xp,.5
7. A onestepreweightedestimateis obtainedby
Sfu
(0
0
*
~~~ ~ ~ ~~~S.
~~~ ~ ~ ~ ~ ~
S
cm
T1
(=l
WiXi
(
Wi
0
0
a_
0~~~~
0~~~~~
C 
S1 =
Wi(Xix
 Tl)
)/
(E Win
I i=l
where
where
wi
4
 1 if
d(TMcD,SMcD)(i)
<
= 0 otherwise.
2
0
X1
Figure7. Exact FitSituation(n = 100, p = 2).
Xp,.975
There
are
55
observations
in
the
entire
dataset
of
that lie
on the line with the equaThe programFASTMCD has been thoroughlytested 100 observations
tion
and can be obtainedfromour Web site http://winwww.
.000000(xil  ml) + 1.000000(i2  m2) = 0,
It has been incorporated
uia.ac.be/u/statis/index.html.
into where the mean (ml, m2) of these observations
is the
SPLUS 4.5 (as the function"cov.mcd")and it is also in MCD location:
.10817
SAS/IML 7 (as thefunction"MCD").
5.00000
and their
5.
covariance
matrix is the MCD scatter
matrix
EXACT FIT SITUATIONS
1.40297 .00000
An importantadvantageof the FASTMCD algorithm
.00000 .00000
is thatit allows for exact fitsituationsthatis, when h
the data are in an "exact
Therefore,
fit" position.
or more observationslie on a hyperplane.Then the algo In such a situation
the MCD scatter
matrix has deterrithmstill yields the MCD location T and scattermatrix minant 0, and its tolerance ellipse becomes the line
S, the latterbeing singularas it should be. From (T, S) of exact fit.
the programthen computes the equation of the hyperIf theoriginaldata werein p dimensionsand it turnsout
plane.
thatmostof thedatalie on a hyperplane,
itis possibleto apWhen n is largerthan(say) 600, thealgorithmperforms
plyFASTMCD againto thedatain this(p  1)dimensional
manycalculationson subsetsof the data. To deal withthe
space.
combinationof largen and exactfits,we added a fewsteps
to the algorithm.Suppose that,duringthecalculationsin a
6. PERFORMANCE OF FASTMCD
subset, we encounter some (Tsub, Ssub) with det(Ssub) = 0.
To get an idea of the performance
of the overall algoThen we knowthatthereare h,,b or moreobservationson rithm,we start
by applyingFASTMCD to some small
thecorresponding
Firstwe checkwhetherh or datasetstakenfromRousseeuw and
hyperplane.
Leroy (1987). To be
morepointsof thefulldatasetlie on thishyperplane.
If so, precise,thesewereall regressiondatasets,butwe ranFASTwe compute(Tfull,Sfull)as themeanand covariancematrix MCD
variablesthatis, notusing
onlyon theexplanatory
of all pointson thehyperplane,
reportthisfinalresult,and theresponsevariable.The firstcolumnof Table 1 liststhe
stop; if not, we continue.Because det(Ssub) = 0 is the name of each dataset,followedby n and p. We used the
best solution for that subset, we know that (Tsub,
Ssub) will defaultvalue of h = [(n + p + 1)/2]. The next column
be among the 10 best solutionsthatare passed on. In the showsthenumberof
starting
(p + 1)subsetsused in FASTmergedset we take the set H' of the hmergedobservations MCD, whichis usually500 exceptfortwodatasetsin which
withsmallestorthogonaldistancesto the hyperplane,and the numberof
possible (p + 1)subsetsout of n was fairly
startthe next Cstep fromH'. Again, it is possible that
smallnamely,(1) = 220 and (38) = 816so we used all
duringthecalculationsin themergedsetwe encountersome of them.
(Tmerged, Smerged) withdet(Smerged) = 0, in whichcase we
The nextentryin Table 1 is the resultof FASTMCD,
repeattheprecedingprocedure.
givenhereas the finalhsubset.Comparingthesewiththe
As an illustration,
the datasetin Figure7 consistsof 45 exact MCD
algorithmof Agullo (personalcommunication,
observations
generatedfroma bivariatenormaldistribution, 1997), it turnsout thatthese hsubsetsdo
yield the exact
plus 55 observationsthatwere generatedon a straightline globalminimumof the
The nextcolumn
objectivefunction.
(using a univariatenormaldistribution).
The FASTMCD showstherunningtimeof FASTMCD in secondson a Sun
program(withdefaultvalue h = 51) findsthisline within Ultra 2170. These times are much shorterthan those of
.3 seconds.A partof theoutputfollows:
our MINVOL programforcomputingtheMVE estimator.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
219
Table 1. Performanceof the FASTMCDand FSA Algorithms
on Some Small Datasets
Time (seconds)
Dataset
Starts
Heart
Phosphor
Stackloss
Coleman
Wood
Salinity
HBK
12
18
21
20
20
28
75
2
2
3
5
5
3
3
220
816
500
500
500
500
500
Best hsubset found
1 3 4
3 5 8
4 5 6
2 3 4
1 2 3
1 2 6
15 16
33 35
59 61
FASTMCD
FSA
.6
1.8
2.1
4.2
4.3
2.4
5.0
.6
3.7
4.6
8.9
8.2
8.6
71.5
5 7 9 11
9 11 12 13 14 15 17
7 8 9 10 11 12 13 14 20
5 7 8 12 13 14 16 17 19 20
5 9 10 12 13 14 15 17 18 20
7 8 12 13 14 18 20 21 22 25 26 27 28
17 18 19 20 21 22 23 24 26 27 31 32
36 37 38 40 43 49 50 51 54 55 56 58
63 64 66 67 70 71 72 73 74
We mayconcludethatforthesesmalldatasetsFASTMCD suffice,whereasno previousalgorithmwe know of could
handlesuch largedatasets.
gives veryaccurateresultsin littletime.
Let us now trythe algorithmon largerdatasets,with
The currently
most wellknownalgorithmfor approxin > 100. In each dataset,we generatedover 50% of the matingtheMCD estimatoris thefeasiblesubsetalgorithm
normaldistribution (FSA) of Hawkins(1994). Insteadof Csteps,it uses a difpointsfromthe standardmultivariate
and
the
from
remainingpoints
Np(0, Ip),
Np(p,,Ip), where ferentkindof steps,whichforconveniencewe will baptize
p = (b,b,... b)' withb = 10. This is the model of "shift "Isteps,"wheretheI standsfor"interchanging
points."An
outliers."For each dataset,Table 2 listsn,p, thepercentage Istepproceedsas follows:
of majoritypoints,and the percentageof contamination. Given the hsubsetHold with its average Told and its
The algorithmalwaysused 500 startsand thedefaultvalue covariancematrixSold,
of h = [(n + p + 1)/2].
of h =
+ p p+
* repeatforeach i E Hold and each j X Hold
1)/2]i ~[(nThe resultsof FASTMCD
are givenin thenextcolumn,
under"robust."Here "yes" meansthatthe correctresultis
* put Hij
(Hold\ {i}) U {j}
obtainedthatis, corresponding
to thefirstdistribution
(i.e., removepointi and add pointj);
[as
* compute
in Fig. 4(b)]whereas "no" standsfor the nonrobustreAi, = det(Sold)  det(S(Hi,));
describethe entiredataset[as
suit,in whichthe estimates
wh l
t
. ,,.
.
*
,. , ./
, M^?
thei' ad
and j'
i the
,
j' withlargestAi,j,;
in Fig. 4(a)]. Table ^2 lists
data situations
* keep
f
with
<
highest
and stop;
H old and
ifA~< Hnew
, pt
put new Hold
stop;
percentage of outlying observations still yielding the clean
result with FASTMCD, as was suggested by a referee.That
is, thetablesays whichpercentageof outliersthealgorithm
can handle for given n and p. Increasingthe numberof
startsonly slightlyimprovesthispercentage.The computationtimeswere quite low forthe givenvalues of n and
p. Even fora samplesize as highas 50,000,a fewminutes
' >
t Hnew = H,'j
An IsteptakesO(h(n  h)) = 0(n2) timebecause all pairs
(i,j) are considered.If we would computeeach S(Hij)
fromscratch,the complexitywould even become 0(n3),
but Hawkins (1994, p. 203) used an update formulafor
det(S(Hi,j)).
Table 2. Performanceof the FASTMCDand FSA Algorithms
on LargerDatasets, WithTimein Seconds
FASTMCD
n
100
500
1,000
10,000
50,000
p
2
5
10
20
2
5
10
30
2
5
10
30
2
5
10
30
2
5
10
30
FSA
% Np(O, Ip)
% Np(1u,Ip)
Robust
Time
Robust
51
53
63
77
51
51
64
77
51
51
60
76
51
51
63
49
47
37
23
49
49
36
23
49
49
40
24
49
49
37
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
2
5
40
70
7
25
84
695
8
20
75
600
9
25
85
yes
no
no
no
no
no
no
no
no
76
51
24
49
yes
yes
700
15
58
75
42
25
yes
yes
140
890
51
49
yes
Time
50
80
110
350
2,800
3,800
4,100
8,300
20,000

45
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
220
PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN
The Istep can be iterated: If det(Snew) < det(Sold), we
can take anotherIstepwithHnew; otherwise,we stop.The
resulting sequence det(S1) > det(S2) > ... must converge
Hawkins (1994), whereasthe new versionof FSA is subfaster(althoughit retainsthesame computational
stantially
complexityas the originalFSA due to 2 in the preceding
list).
In conclusion, we personallypreferthe FASTMCD
algorithmbecause it is both robust and fast, even for
largen.
aftera finitenumberof steps; that is, det(S,) = 0 or
det(Sm) = det(SmI_), so det(Sm) can no longerbe reduced by an Istep.This is again a necessary(butnotsufficient)conditionfor(Tm, Sm) to be theglobal minimumof
theMCD objectivefunction.
In ourterminology,
Hawkins's
7. APPLICATIONS
FSA algorithmcan be writtenas follows:
Let us now look at some applicationsto comparethe
* repeatmanytimes:
FASTMCD resultswiththeclassical mean and covariance
* draw an initialhsubsetH1 at random;
matrix.At the same timewe will illustratea new tool,the
* carryout
Istepsuntilconvergence,
yieldingHm;
distancedistance
plot.
* keep the Hm withlowestdet(Sm);
1.
We
startwiththecoho salmondataset(see
Example
* reportthisset Hm as well as (Tm, Sm).
Nickelson1986) withn = 22 and p = 2, as shownin FigIn Tables 1 and 2 we have appliedtheFSA algorithmto
ure 8(a). Each data pointcorrespondsto one year.For 22
thesame datasetsas FASTMCD, usingthesamenumberof
theproductionof coho salmonin thewild was meastarts.For thesmalldatasetsin Table 1,theFSA and FAST years
sured,in theOregonProductionArea. The xcoordinateis
MCD yielded identicalresults.This is no longertruein
the logarithmof millionsof smolts,and the ycoordinate
Table 2, wheretheFSA beginsto findnonrobustsolutions.
is the logarithmof millionsof adultcoho salmon.We see
This is because of thefollowing:
thatin mostyearstheproductionof smoltslies between2.2
1. The FSA startsfromrandomlydrawnhsubsetsH1. and 2.4 on a logarithmicscale, whereasthe productionof
Hence, for sufficiently
large n all of the FSA startsare adultslies between1.0 and .0. The MCD toleranceellipse
and subsequentiterationsdo notget away from excludestheyearswitha lowersmoltsproduction,
nonrobust,
thereby
thecorresponding
local minimum.
We saw thesame effectin Section3.1, whichalso explained
(a)
whyit is betterto startfromrandom(p + 1)subsetsas in
MINVOL and in FASTMCD.
The tables also indicatethatthe FSA needs more time
thanFASTMCD. In fact,time(FSA)/time(FASTMCD)increases from1 to 14 forn goingfrom12 to 75. In Table 2,
the timingratiogoes from25 (forn = 100) to 2,500 (for
n = 1,000),afterwhichwe could no longertimetheFSA
The FSA algorithm
is moretimeconsuming
than
algorithm.
FASTMCD because of thefollowing:
2. An IsteptakesO(n2) time,comparedto O(n) forthe
Cstepof FASTMCD.
3. Each Istep swaps only one point of Hold with one
point outside Hold. In contrast,each Cstep swaps hHoldn HnewI pointsinsideHoldwiththesame numberoutside of Hold.Therefore,
moreIstepsare needed,especially
forincreasingn.
4. The FSA iteratesIsteps untilconvergence,starting (b)
fromeach H1. On theotherhand,FASTMCD reducesthe
numberof Csteps by the selectiveiterationtechniqueof
Section 3.2. The latterwould not workforIstepsbecause
of 3.
5. The FSA carriesout all its Istepsin the fulldataset
of size n, even for large n. In the same situation,FASTMCD applies thenestedextensionsmethodof Section3.3,
so most Csteps are carriedout fornsub= 300, some for
nmerged
TOLERANCE ELLIPSES (97.5%)
L
O=
EI
0
smolts
logmillion
DISTANCEDISTANCE
PLOT
CD
*21
a)
a,
co
11'
.0
o
0
a
~~~~.1
*15
CM
= 1,500, and only a few for the actual n.
..
While thisarticlewas underreview,Hawkinsand Olive
,
0 """""
,
0.5
0.0
1.0
(1999) proposedan improvedversionof theFSA algorithm,
1.5
2.0
2.5
3.0
Mahalanobis
Distance
as describedat theend of ourSection2. To avoidconfusion,
Figure8. Coho Salmon Data: (a) ScatterplotWith97.5% Tolerance
we would like to clarifythatthetimingsin Tables 1 and 2
Ellipses
Describingthe MCD and the Classical Method; (b) Distancewere made withthe originalFSA algorithmdescribed
by Distance Plot.
O
.....
.....................................
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
theclassical tolerance
markingthemas outliers.In contrast,
contains
the
whole
dataset
and thusdoes not
nearly
ellipse
detecttheexistenceof faroutliers.
Let us now introducethe distancedistanceplot (DD
plot),whichplotstherobustdistances(based on theMCD)
versusthe classical Mahalanobis distances.On both axes
in Figure8(b) we have indicatedthe cutoffvalue X2,.975
(herep = 2, yielding X 975= 2.72). If thedata werenot
contaminated(say, if all the data would come froma sinthenall pointsin theDD
gle bivariatenormaldistribution),
would
near
line.
In this example many
lie
the
dotted
plot
lie
in
the
where
both
distancesare regupoints
rectangle
lie
lar, whereasthe outlyingpoints higher.This happened
because the MCD ellipse and the classical ellipse have a
different
orientation.
Naturally,the DD plot becomes more usefulin higher
dimensions,whereit is not so easy to visualize thedataset
and theellipsoids.
221
Somethinghappenedin the productionprocessthatwas
not visible fromthe classical distancesshownin Figure 1.
Figure 9(a) also shows a remarkablechange afterthe first
and
100 measurements.
These phenomenawereinvestigated
the
at
Note
that
the
DD
by
Philips.
interpreted
engineers
plot in Figure9(b) again contraststhe classical and robust
analysis.In Figure9, (a) and (b), one can in factsee three
groupsthe first100 points,thosewithindex491 to 565,
and themajority.
Problem2 (continued). We now apply FASTMCD to
the same n = 132,402 celestial objects withp = 6 variables as in Figure 3, which took only 2.5 minutes.(In
fact,runningthe programon the same objects in all 27
dimensionstook only 18 minutes!)Figure 10(a) plots the
resultingMCDbased robustdistances.In contrastto the
Mahalanobis distancesin Figure 3,
homogeneouslooking
therobustdistancesin Figure 10(a) clearlyshowthatthere
is a majoritywith RD(xi) <
2X. 975 as well as a sec
Problem1 (continued). Next we considerProblem1 in ond group with RD(xi) between 8 and 16. By exchangSection 1. The Philips data represent677 measurements ing our findingswith the astronomersat the California
of metalsheetswithnine componentseach, and theMaha Instituteof Technology,we learnedthatthe lower group
lanobisdistancesin Figure1 indicatedno groupsof outliers. consists mainlyof stars and the upper group mainlyof
The MCDbased robustdistancesRD(xi) in Figure9(a) tell galaxies.
Our main pointis thatthe robustdistancesseparatethe
a different
story.We now see a stronglydeviatinggroupof
data
in two partsand thusprovidemore information
than
outliers,rangingfromindex491 to index565.
(a)
(a)
*
L.)
S~
~~V*
'
U"
A~~~~~~~~~
~~~ ~ ~ ~~~~~~~~'
i5'()
.0
*?
3qr~~J..
0
'S.
*.~~~~~~~~~~~~~~~~
(0
0
U,
oC
*? 0
0.
:,.o 0%
*1
00
0
r
0
O
200
400
Index
600
2000
4000
6000
Index
10000
8000
(b)
(b)
0
Figure 9. Philips Data: (a) Plot of Robust Distances; (b) DistanceDistance Plot.
6
4
Mahalanobis
Distance
Figure 10. DigitizedPalomar Data: (a) Plot of Robust Distances of
Celestial Objects; (b) TheirDistanceDistancePlot.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
PETER J. ROUSSEEUW AND KATRIENVANDRIESSEN
222
8. CONCLUSIONS
theMahalanobis
distances.
Thatrobust
distances
andMahabehavedifferently
lanobisdistances
is illustrated
in Figure The algorithm
FASTMCDproposedin thisarticleis
10(b),wherewe see thestarsnearthediagonallineandthe specifically
tailoredto theproperties
of theMCD estimagalaxiesaboveit.
tor.The basicideasaretheCstep(Theorem1 in Sec. 2),
Of courseouranalysisofthesedatawas muchmoreex theprocedure
forgenerating
initialestimates
(Sec. 3.1),setensiveandalsousedotherdataanalytic
notde lectiveiteration
techniques
(Sec. 3.2),andnestedextensions
(Sec. 3.3).
scribedhere,buttheabilityto computerobustestimates By exploiting
thespecialstructure
oftheproblem,
thenew
of locationand scatterforsuchlargedatasetswas a key algorithm
is faster
andmoreeffective
thangeneralpurpose
tool.Based on our work,theseastronomers
are thinkingtechniques
suchas reducing
theobjectivefunction
bysucaboutmodifying
theirclassification
ofobjectsintostarsand cessively
Simulations
have
shown
that
interchanging
points.
forthefaintlightsourcesthatarediffi FASTMCDis abletodealwithlargedatasets
galaxies,especially
whileoutruncultto classify.
forMVE andMCD byordersof
ningexisting
algorithms
Another
ofFASTMCDis itsability
magnitude.
advantage
robust to detectexactfitsituations.
Example2. We end thissectionby combining
withrobustregression.
The firedata(Anlocation/scatter
Due to theFASTMCDalgorithm,
theMCD becomes
drewsandHerzberg1985)reported
theincidences
of fires accessibleas a routine
toolforanalyzing
multivariate
data.
in47 residential
areasofChicago.Onewantstoexplainthe Without
extracostwe also obtaintheDD plot,a newdata
incidence
offirebytheageofthehouses,theincomeofthe
displaythatplotstheMCDbasedrobustdistancesversus
families
oftheft.
Forthis theclassicalMahalanobis
livingin them,andtheincidence
distances.
Thisis a usefultoolto
we applytheleasttrimmed
squares(LTS) methodof ro explorestructure(s)
in thedata.Otherpossibilities
include
bustregression,
withtheusualvalueof h = [(3/4)n]= 35. an MCDbasedPCA androbustified
versions
ofothermulIn SPLUS 4.5, the function
"ltsreg"now automaticallytivariate
analysismethods.
calls thefunction
which
runstheFASTMCD
"cov.mcd,"
to obtainrobustdistancesin xspacebasedon
algorithm,
ACKNOWLEDGMENTS
theMCD withthesameh. Moreover,
SPLUS automatiWe thankDoug Hawkinsand JoseAgulloformaking
callyprovidesthediagnostic
plotof Rousseeuwand van
availableto us. We also dedicatespecial
Zomeren(1990), whichplotstherobustresidualsversus theirprograms
therobustdistances.For thefiredata,thisyieldsFigure thanksto GertjanOtten,FransVanDommelen,
and Herman
Veraa
for
us
access
to
the
which
shows
data
andto
thepresenceof one vertical
11,
outlierthat
giving
Philips
witha smallrobustdistanceanda large S. C. Odewahnandhisresearch
Inis, an observation
groupattheCalifornia
of Technology
forallowingus to analyzetheirdigLTS residual.We also see twobad leveragepointsthat stitute
to therefereesand
x andsuchthat(x,y) itizedPalomardata.We are grateful
is, observations
(x,y) withoutlying
Technometrics
editors
Max
Morris
does not followthe lineartrendof the majority
and
KarenKafadarfor
of the
comments
thepresentation.
data.The otherobservations
withrobustdistancesto the helpful
improving
ofthevertical
cutoff
linearegoodleverage
right
pointsbecausetheyhavesmallLTS residualsandhencefollowthe
APPENDIX: PROOF OF THEOREM 1
samelinearpattern
as themaingroup.In Figure11 we see
Proof. Assumethatdet(S2) > 0; otherwise
theresult
thatmostof thesepointsare merelyboundary
cases,exis
satisfied.
We
can
thus
=
already
computed2(i) d(T2,s,)(i)
ceptforthetwoleveragepointsthatare reallyfaroutin
forall i = 1,..., n. Using IH21= h and the definition
of
xspace.
(T2,S2), we find
DIAGNOSTICPLOT
1
ptr E (xi
d(i)
2
iEH2
T2)S21
(x
T2)'
iCH2
= h tr E S2 (xiT2)(xi
a
u,
a)
iEH2
2.5
*
c)
1 trSS2p
O
0
'O
La
CI
:
*
'a
*
S
05
c,
*?
A*2.5
2.5
,
(A.1)
Moreover,
put
* S
*
CM
.
 tr(I)= 1.
Figure 11. DiagnosticPlot of the Fire Dataset.
E
iEH2
, .
4
6
8
10
RobustDistancebased on theMCD
A:p
d1(i)
2
l )i:n
(d
i=1
12
jEH1
2(j)
1,
(A.2)
whereA > 0 becauseotherwise
det(S2) = 0. Combining
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions
A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR
(A.1) and (A.2) yields
hp
E d(TI'AS1)(i)
ZEH2
h rSll(xiE (XiiEH2
T1)
G (T2
bt
Griibel (1988) proved that (T2, S2) is the unique
minimizer of det(S)
for which
among all (T,S)
= 1. This implies that det(S2) <
(1/hp) ZiH2 d(TS)(i)
det(ASi). On theotherhandit followsfromtheinequality
(A.2) thatdet(AS1) < det(S1), hence
det(S2) < det(ASi) < det(Si).
(A.3)
Moreover, note that det(S2) = det(S1) if and only if both
inequalitiesin (A.3) are equalities.For the first,we know
from Griibel's result that det(S2) = det(ASi) if and only if
(T2, S1) = (T!, ASi). For the second,det(AS1) = det(S1)
if and only if A = 1; that is, S1 = ASI. Combining both
yields (T2, S2) =(T1, S).
[ReceivedDecember1997. RevisedMarch 1999.]
REFERENCES
Agull6,J. (1996), "Exact IterativeComputationof theMultivariateMinimumVolumeEllipsoidEstimatorWitha Branchand BoundAlgorithm,"
in Proceedingsin ComputationalStatistics,ed. A. Prat, Heidelberg:
PhysicaVerlag,
pp. 175180.
Andrews,D. F., and Herzberg,A. M. (1985), Data, SpringerVerlag,
New
York.
Butler,R. W., Davies, P. L., and Jhun,M. (1993), "Asymptotics
for the
MinimumCovarianceDeterminant
TheAnnalsof Statistics,
Estimator,"
21, 13851400.
T. P. (1993), "A BoundedInfluence,
Coakley,C. W., and Hettmansperger,
JournaloftheAmerHigh Breakdown,Efficient
RegressionEstimator,"
ican StatisticalAssociation,88, 872880.
Cook, R. D., Hawkins,D. M., and Weisberg,S. (1992), "Exact Iterative
Computationof the Robust MultivariateMinimumVolume Ellipsoid
Statisticsand ProbabilityLetters,16, 213218.
Estimator,"
Croux, C., and Haesbroeck,G. (in press), "InfluenceFunctionand Efficiency of the MinimumCovarianceDeterminantScatterMatrixEstimator,"Journalof MultivariateAnalysis.
Davies, L. (1992), "The Asymptoticsof Rousseeuw's MinimumVolume
The Annalsof Statistics,20, 18281843.
Ellipsoid Estimator,"
Donoho, D. L. (1982), "BreakdownPropertiesof MultivariateLocation
Estimators,"unpublishedPh.D. qualifyingpaper, HarvardUniversity,
Dept. of Statistics.
223
oftheCovarianceMatrix,"
Gribel,R. (1988), "A MinimalCharacterization
Metrika,35, 4952.
Hampel, F. R., Ronchetti,E. M., Rousseeuw,P. J., and Stahel, W. A.
(1986), RobustStatistics,The ApproachBased on InfluenceFunctions,
New York:Wiley.
Hawkins,D. M. (1994), "The Feasible SolutionAlgorithmforthe MinimumCovarianceDeterminant
Estimatorin Multivariate
Data," ComputationalStatisticsand Data Analysis,17, 197210.
Hawkins,D. M., and McLachlan,G. J. (1997), "HighBreakdown
Linear
Discriminant
Analysis,"JournaloftheAmericanStatisticalAssociation,
92, 136143.
Hawkins,D. M., and Olive, D. J. (1999), "ImprovedFeasible Solution
forHigh BreakdownEstimation,"ComputationalStatistics
Algorithms
and Data Analysis,30, 111.
Lopuhaa,H. P., and Rousseeuw,P. J.(1991), "BreakdownPointsof Affine
Locationand CovarianceMatriEquivariantEstimatorsof Multivariate
ces," The Annalsof Statistics,19, 229248.
Maronna,R. A. (1976), "Robust Mestimatorsof MultivariateLocation
and Scatter,"The Annalsof Statistics,4, 5156.
Meer,P., Mintz,D., Rosenfeld,A., and Kim, D. (1991), "RobustRegression Methodsin ComputerVision:A Review,"International
Journalof
ComputerVision,6, 5970.
Nickelson,T. E. (1986), "Influenceof Upwelling,Ocean Temperature,
and
SmoltAbundanceon Marine Survivalof Coho Salmon (Oncorhynchus
Kisutch)in theOregonProductionArea,"Canadian JournalofFisheries
and Aquatic Sciences,43, 527535.
R. J.,and Gal, R. (1998), "Data
Odewahn,S. C., Djorgovski,S. G., Brunner,
From the Digitized Palomar Sky Survey,"technicalreport,California
Instituteof Technology.
D. L. (1996), "Identification
of Outliersin
Rocke, D. M., and Woodruff,
MultivariateData," Journalof theAmericanStatisticalAssociation,91,
10471061.
Rousseeuw,P. J. (1984), "Least Median of Squares Regression,"Journal
of theAmericanStatisticalAssociation,79, 871880.
EstimationWithHigh BreakdownPoint,"in
(1985), "Multivariate
MathematicalStatisticsand Applications,VolB, eds. W. Grossmann,G.
Pflug,I. Vincze,and W. Wertz,Dordrecht:Reidel,pp. 283297.
to PositiveBreakdown
(1997), "Introduction
Methods,"in Handbook of Statistics,Vol. 15: RobustInference,eds. G. S. Maddala and
C. R. Rao, Amsterdam:Elsevier,pp. 101121.
Rousseeuw,P. J.,and Leroy,A. M. (1987), RobustRegressionand Outlier
Detection,New York:Wiley.
Rousseeuw,P. J.,and van Zomeren,B. C. (1990), "UnmaskingMultivariate Outliersand Leverage Points,"Journalof theAmericanStatistical
Association,85, 633639.
Simpson,D. G., Ruppert,D., and Carroll,R. J.(1992), "On OneStepGMestimatesand Stabilityof Inferencesin LinearRegression,"Journalof
theAmericanStatisticalAssociation,87, 439450.
D. L., and Rocke, D. M. (1993), "HeuristicSearchAlgorithms
Woodruff,
for the MinimumVolume Ellipsoid,"Journalof Computationaland
GraphicalStatistics,2, 6995.
(1994), "ComputableRobustEstimationof MultivariateLocation
and Shape in High DimensionUsing CompoundEstimators,"Journal
of theAmericanStatisticalAssociation,89, 888896.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions