Você está na página 1de 13

American Society for Quality

A Fast Algorithm for the Minimum Covariance Determinant Estimator


Author(s): Peter J. Rousseeuw and Katrien van Driessen
Source: Technometrics, Vol. 41, No. 3 (Aug., 1999), pp. 212-223
Published by: Taylor & Francis, Ltd. on behalf of American Statistical Association and American
Society for Quality
Stable URL: http://www.jstor.org/stable/1270566
Accessed: 01-11-2015 04:37 UTC
REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/1270566?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.

Taylor & Francis, Ltd., American Statistical Association and American Society for Quality are collaborating with
JSTOR to digitize, preserve and extend access to Technometrics.

http://www.jstor.org

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FastAlgorithm
fortheMinimum
Covariance
Determinant
Estimator
Peter J. ROUSSEEUW

Katrien VANDRIESSEN

Departmentof Mathematics
and ComputerScience
Universitaire
InstellingAntwerpen
1
Universiteiteitsplein
B-2610 Wilrijk
Belgium
(peter.rousseeuw@uia.ua.ac.be)

Facultyof AppliedEconomics
Universitaire
Faculteiten
Sint Ignatius
Prinsstraat13
B-2000 Antwerp
Belgium
(katrien.
vandriessen@ufsia.ac. be)

Theminimum
covariance
determinant
ofRousseeuw
is a highly
robust
estimator
(MCD) method
of
multivariate
location
andscatter.
Itsobjective
is tofindh observations
(outofn) whosecovariance
matrix
has thelowestdeterminant.
Untilnow,applications
of theMCD werehampered
by the
timeof existing
whichwerelimited
to a fewhundred
computation
algorithms,
objectsin a few
dimensions.
Wediscusstwoimportant
oflarger
size,oneabouta production
applications
processat
anda dataset
fromastronomy
withn = 137,256
Philipswithn = 677 objectsandp = 9 variables,
To deal withsuchproblems
we havedeveloped
a newalgorithm
objectsandp = 27 variables.
fortheMCD, calledFAST-MCD.Thebasicideasarean inequality
orderstatistics
and
involving
andtechniques
whichwecall"selective
iteration"
and"nested
extensions."
determinants,
Forsmall
FAST-MCDtypically
findstheexactMCD, whereasforlargerdatasetsit givesmore
datasets,
accurate
resultsthanexisting
andis faster
FASTalgorithms
Moreover,
byordersof magnitude.
MCD is ableto detectan exactfit-thatis,a hyperplane
h or moreobservations.
The
containing
newalgorithm
makestheMCD method
availableas a routine
toolforanalyzing
multivariate
data.
Wealsoproposethedistance-distance
MCD-basedrobust
distances
plot(D-D plot),whichdisplays
versusMahalanobis
andillustrate
itwithsomeexamples.
distances,
KEY WORDS: Breakdown
location
andscatter;
Outlier
value;Multivariate
detection;
Regression;
Robustestimation.

It is difficult
to detectoutliersin p-variate
data when Positive-breakdown
methods
suchas theMVE andleast
2
>
because
one
trimmed
can
no
(Rousseeuw1984)areincreasp
squaresregression
longerrelyon visualinspection.Although
itis stillquiteeasyto detecta singleoutlier inglybeingusedinpractice-for
infinance,
chemexample,
electrical
andcomputer
thisapproachno istry,
engineering,
processcontrol,
distances,
by meansof theMahalanobis
andKim 1991).Fora surformultiple
outliers
becauseofthemasking vision(Meer,Mintz,Rosenfeld,
longersuffices
of
methods
and somesubstantive
vey
positive-breakdown
which
outliers
do notnecessarily
effect,
have
by
multiple
see
Rousseeuw
applications,
(1997).
largeMahalanobisdistances.It is betterto use distances
forapproximating
the
algorithm
basedonrobust
estimators
ofmultivariate
locationandscat- The basic resampling
called
was
MVE,
MINVOL,
proposedby Rousseeuwand
ter(Rousseeuw
andLeroy1987,pp.265-269).Inregression
This
considersa trialsubsetof
(1987).
Leroy
algorithm
robust
distances
fromtheexplanatory
analysis,
computed
1
observations
and
+
calculatesitsmeanandcovariance
variablesallowus to detectleveragepoints.Moreover,
ro- p
matrix.
The
or decorresponding
ellipsoidis theninflated
bustestimation
of multivariate
locationand scatteris the
flatedto containexactlyh observations.
This
is
procedure
othermultivariate
suchas
keytoolto robustify
techniques
times,andtheellipsoidwiththelowestvolrepeated
many
principal-component
analysisanddiscriminant
analysis. umeis retained.
Forsmalldatasets
itis possibletoconsider
forestimating
multivariate
locationand all subsetsofsizep+ 1,whereas
Manymethods
forlargerdatasets
thetrial
scatterbreakdownin thepresenceof n/(p+ 1) outliers, subsetsaredrawnat random.
wheren is thenumber
of observations
andp is thenum- Severalotheralgorithms
havebeenproposedto approxberofvariables,
as waspointed
outbyDonoho(1982).For imatetheMVE. Woodruff
and Rocke(1993) constructed
thebreakdown
valueof themultivariate
M-estimators
of algorithms
the
withthree
combining resampling
principle
Maronna(1976),see Hampel,Ronchetti,
and heuristic
searchtechniques-simulated
Rousseeuw,
annealing,
genetic
Stahel(1986,p. 296). In themeantime,
severalpositive- algorithms,
and tabusearch.Otherpeopledevelopedalbreakdown
estimators
of multivariate
to computetheMVE exactly.Thisworkstarted
locationand scatter gorithms
havebeenproposed.
with
the
One of theseis theminimum
ofCook,Hawkins,
andWeisberg
volume
algorithm
(1992),
method
of
Rousseeuw
ellipsoid(MVE)
(1984,p. 877; 1985).
Thisapproachlooksfortheellipsoidwithsmallest
volume
? 1999American
Statistical
Association
thatcoversh datapoints,wheren/2 < h < n. Its breakandtheAmerican
forQuality
Society
downvalueis essentially
(n - h)/n.
AUGUST1999,VOL.41, NO.3
TECHNOMETRICS,
212

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

213

Problem 1 (Engineering). We are gratefulto Gertjan


whichcarriesout an ingeniousbutstillexhaustivesearchof
all possible subsetsof size h. In practice,thiscan be done Ottenforprovidingthefollowingproblem.PhilipsMecoma
for n up to about 30. Recently,Agullo (1996) developed (The Netherlands),is producingdiaphragmpartsfor TV
an exact algorithmforthe MVE thatis based on a branch sets. These are thinmetal plates,molded by a press. Reand bound procedurethatselects the optimalsubsetwith- centlya new productionline was started,and foreach of
out requiringthe inspectionof all subsetsof size h. This n = 677 parts,nine characteristics
were measured.The
is substantially
fasterand can be applied up to (roughly) aim of the multivariate
analysisis to gain insightintothe
n < 100 and p < 5. Because for most datasetsthe exact
betweenthenine
productionprocessand the interrelations
would taketoo long,theMVE is typicallycom- measurements
algorithms
and to findout whetherdeformations
or abputedby versionsof MINVOL-for example,in S-PLUS normalitieshave occurredand why. Afterward,the esti(see thefunctioncov.mve).
mated location and scattermatrixcan be used for multiPresentlythereare severalreasonsforreplacingtheMVE variatestatisticalprocesscontrol.
(MCD) estimaby the minimumcovariance determinant
Due to thesupportof HermanVeraaand FransVan Domwhich
was
also
Rousseeuw
tor,
(1984, p. 877;
proposedby
melen(at PhilipsPMF/Mecoma,ProductEngineering,
P.O.
1985). The MCD objectiveis to findh observations(out
Box 218, 5600 MD Eindhoven,The Netherlands),we obof n) whose classical covariancematrixhas the lowestdeterminant.
The MCD estimateof location is thenthe av- tainedpermissionto analyze thesedata and to publishthe
results.
erage of these h points,and the MCD estimateof scatter
Figure 1 showstheclassical Mahalanobisdistance
is theircovariancematrix.The resultingbreakdownvalue
equals thatof the MVE, but the MCD has severaladvan(1.1)
MD(xi) = J(x- To)'So(xiTo)
is betterbetages over the MVE. Its statisticalefficiency
cause the MCD is asymptotically
normal(Butler,Davies,
versustheindexi, whichcorrespondsto theproductionseand Jhun1993), whereas the MVE has a lower converTo is the arithmetic
quence. Here xi is nine-dimensional,
gence rate (Davies 1992). As an example,the asymptotic
and
is
the
classical
covariance
matrix.The horiSo
of theMCD scattermatrixwiththetypicalcover- mean,
efficiency
zontalline is at theusual cutoffvalue VX
x975= 4.36.
age h = .75n is 44% in 10 dimensions,and thereweighted
covariancematrixwith weightsobtainedfromthe MCD
In Figure1 it seemsthatmostobservationsare consistent
attains83% of efficiency
normalmodel,exceptfora
(Croux and Haesbroeckin press), withthe classical multivariate
whereasthe MVE attains0%. The MCD's betteraccuracy few isolated outliers.This shouldnot surpriseus, even in
makes it veryusefulas an initialestimateforone-stepre- thefirstexperimental
runof a new productionline because
estimators
and
Carroll
1992;
gression
(Simpson,Ruppert,
theMahalanobisdistancesare knownto sufferfrommask1993). Robustdistancesbased ing. That is, even if therewere a groupof outliers(here,
Coakleyand Hettmansperger
on theMCD are moreprecisethanthosebased on theMVE deformed
diaphragmparts)theywould affectTo and So in
and hence bettersuited to expose multivariateoutliers- such a
way as to become invisiblein Figure 1. To further
forexample,in the diagnosticplot of Rousseeuw and van
investigatethesedata,we need robustestimatorsT and S,
Zomeren(1990), whichdisplaysrobustresidualsversusrowitha substantialstatisticalefficiency
so thatwe
preferably
bust distances.Moreover,the MCD is a key component can be confidentthat
effects
that
become
visible
any
may
of the hybridestimatorsof Woodruffand Rocke (1994)
are real and notdue to theestimator'sinefficiency.
Afterdeand Rocke and Woodruff(1996) and of high-breakdown
theFAST-MCD algorithm,
we will returnto these
veloping
linear discriminantanalysis (Hawkins and McLachlan
data in Section7.
1997).
In spiteof all these advantages,untilnow theMCD has
rarelybeen appliedbecause itwas harderto compute.In this
thatis
article,however,we constructa new MCD algorithm
0
The
actuallymuchfasterthananyexistingMVE algorithm.
new MCD algorithmcan deal witha sample size n in the (D
tensof thousands.As faras we know,none of theexisting Cr0
MVE algorithmscan cope with such large sample sizes.
Because the MCD now greatlyoutperforms
the MVE in CO
termsof both statisticalefficiency
and computationspeed,
owe recommendtheMCD method.
Cl
Ca

1. MOTIVATINGPROBLEMS
Two recentproblemswill be shownto illustratetheneed
fora fast,robustmethodthatcan deal withmanyobjects(n)
and/ormanyvariables(p) while maintaininga reasonable
statisticalefficiency.

200

Index

400

600

Figure 1. Plot of Mahalanobis Distances forthe PhilipsData.


TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN

214

Problem2 (Physical Sciences). A group of astronomersat theCaliforniaInstituteof Technologyare working


on the Digitized Palomar Sky Survey(DPOSS); for a full
see theirreport(Odewahn,Djorgovsky,Brundescription,
and
Gal
1998). In essence, they make a surveyof
ner,
celestialobjects (lightsources)forwhichtheyrecordnine
characteristics
(such as magnitude,area, image moments)
in each of threebands-blue, red,and near-infrared.
They
seek collaborationwith statisticiansto analyze theirdata,
and gave us access to a partof theirdatabase,containing
137,256 celestialobjects withall 27 variables.
We startedby using quantile-quantileplots, Box-Cox
transforms,
selectingone variable out of threevariables
linearcorrelation,
and othertools of data
withnear-perfect
analysis.One of theseavenuesled us to studysix variables
(two fromeach band). Figure2 plots theMahalanobisdistances (1.1) for these data (to avoid overplotting,
Fig. 2
shows only 10,000 randomlydrawnpointsfromthe entire
plot).The cutoffis X ,.975 = 3.82. In Figure2 we see two
12,plus
groupsof outlierswithMD(xi) - 9 and MD(xi)
some outliersstillfurther
away.
Returningto the data and theirastronomicalmeaning,
it turnedout thatthese were all objects for whichone or
morevariablesfell outsidetherangeof whatis physically
possible.So, theMD(xi) did help us to findoutliersat this
stage. We thencleaned the data by removingall objects
whichreduced
witha physicallyimpossiblemeasurement,
our sample size to 132,402. To these data we thenagain
applied the classical mean To and covarianceSo, yielding
theplot of Mahalanobisdistancesin Figure3.
Figure3 looks innocent,like observationsfromthe x/
as ifthedata wouldforma homogeneouspopudistribution,
lation(whichis doubtfulbecause we knowthatthedatabase
we
containsstarsas well as galaxies). To proceedfurther,
estimatesT and S and an algorithm
need high-breakdown
thatcan computethemforn = 132,402.Such an algorithm
in thenextsections.
will be constructed

L0

*
0
5)
.o

*.

**

:0

0 00

C
CuD

*.

Se

-.0 c,

go

's

*I

,*

.:,
.? *'

Or

-1.*

-.:.
-..

.0
.*.

Ve

00 to

-*

,
*.
*o
......

o
0

2000

4000

Index

6000

C])

tM(O -

..,

?6 %

C'se???

0
o

,
m

e ?

,.

.?

2000

4000

10000

8000

6000

Figure3. DigitizedPalomar Data: Plot of Mahalanobis Distances of


Celestial Objects as in Figure2 AfterRemoval of PhysicallyImpossible
Measurements.

BASIC THEOREM AND THE C-STEP


A key step of the new algorithmis the factthat,startto the MCD, it is possible to
ing fromany approximation
withan even lower detercomputeanotherapproximation
minant.
2.

of

Theorem1. Consider a dataset


Xn
{x,...,x?}

p-variate observations. Let H c {1,... n} with IH = h,


and putT1 := (1/h) Ee
xi andS1 :(l/h) EiCHl (Xi-

Ti)(xi - T1)'. If det(Si) : 0, definetherelativedistances


di)

:-

/(xi -T
(xe -)'ie

-T1)

Now take H2 such that {di(i);Zi


(dl)h:ln}, where (di)i:n

(dli)2:n <

for i =

,...,n.

H2} := {(di)i: n ...


< (dnl)nn are the
...

ordereddistances,and computeT2 and S2 based on H2.


Then
det(S2) < det(Si)
withequalityif and onlyif T2 = T1 and S2 = S1.

The proofis givenin theAppendix.Althoughthistheoremappearsto be quitebasic, we have been unableto find


it in theliterature.
The theoremrequiresthatdet(Si) - 0, whichis no real
restriction
because if det(Si) = 0 we alreadyhave theminimal objectivevalue. Section5 will explainhow to interpret
theMCD in such a singularsituation.
If det(Si) > 0, applyingthe theoremyields S2 with
det(S2) < det(Si). In our algorithmwe will referto the
in Theorem1 as a C-step,whereC standsfor
construction
because we concentrateon the h observa"concentration"
tionswithsmallestdistancesand S2 is more concentrated
thanS1. In algorithmic
(has a lowerdeterminant)
terms,the
.C-step can be describedas follows.
Given the h-subset Hold or the pair (Told, Sold), perform

8000 10000 the


following:

Figure2. DigitizedPalomar Data: Plot of Mahalanobis Distances of


Celestial Objects, Based on Six Variables ConcerningMagnitudeand
Image Moments.

1. Computethedistancesdold(i) fori = 1,... ,n.


7 for
2. Sortthesedistances,whichyieldsa permutation

which dold(7r(1)) < dold(W(2)) < .'' < dOld(7r(n)).

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

3. Put Hnew: {r(1),

7r(2)...,

(h)}.

4. ComputeTiew := ave(Hnew)and Snew:= cov(Hnew).


For a fixed numberof dimensionsp, the C-step takes
in O(n)
only O(n) time[because Hnewcan be determined
without
all
the
operations
sorting
dold(i) distances].
Repeating C-steps yields an iteration process. If
det(S2) = 0 or det(S2) = det(S1), we stop; otherwise,
we runanotherC-stepyieldingdet(S3), and so on. The se-

215

1, Hawkinsand Olive (1999) used theC-conditionas a preliminaryscreen,followedby case swappingas a technique


fordecreasingdet(S), as in the feasiblesolutionapproach
(Hawkins 1994), whichwill be describedin Section6. The
C-conditiondid notreducethetimecomplexityof thisapproach,but it did reduce the actual computationtime in
withfixedn.
experiments

3. CONSTRUCTION OF THE NEW ALGORITHM


and hence mustconverge.In fact,because thereare only 3.1 CreatingInitialSubsets H,
To applythe algorithmic
finitely
manyh-subsets,theremustbe an indexr suchthat
concept(2.1), we firsthave to
det(Sm) = 0 or det(Smn)= det(S- 1), hence convergence decide how to constructtheinitialsubsetsH1. Let us conis reached.(In practice,rn is oftenbelow 10.) Afterward, siderthefollowingtwo possibilities:
runningtheC-stepon (Tm, S,) no longerreducesthede1. Draw a randomh-subsetH1.
terminant.
This is notsufficient
fordet(Sm) to be theglobal
2.
Draw a random(p + 1)-subsetJ, and thencompute
minimumof theMCD objectivefunction,
butit is a neces:=
To
ave(J) and So := cov(J). [If det(So) = 0, then
sarycondition.
extend
J
by addinganotherrandomobservation,and conTheorem1 thusprovidesa partialidea foran algorithm:
tinueadding observationsuntildet(So) > 0.] Then compute the distancesd(i) := (xi - To)'So(X. - To) for
Take manyinitialchoices of H1 and applyC-steps
i = 1,..., n. Sortthemintodo(7r(1))< ... < do(7r(n))and
to each untilconvergence,
and keep the
putH := {7r(1),... (h)}.
solutionwithlowestdeterminant.
(2.1) Option1 is thesimplest,whereas2 startsliketheMINVOL
algorithm(Rousseeuw and Leroy 1987, pp. 259-260). It
Of course,severalquestionsmustbe answeredto make(2.1) would be useless to drawfewerthan + 1
p
points,forthen
operational:How do we generatesets H1 to begin with? So is always singular.
How manyH1 are needed? How do we avoid duplication
When the datasetdoes not containoutliersor deviating
of workbecause severalH1 may yieldthe same solution?
whether(2.1) is
groupsof points,it makes littledifference
Can we do withfewerC-steps?What about large sample
with
or
2.
But
1
because
the
MCD
is a veryrobust
applied
sizes? These matterswill be discussedin thenextsections.
we
have
to
consider
contaminated
datasetsin parestimator,
ticular.
For
we
a
dataset
withn = 400
instance,
generated
1.
MCD
The
subset
H
of X, is separated
Corollary
=
observations
and
2
variables,in which205 observations
p
fromX, \ H by an ellipsoid.
were drawnfromthebivariatenormaldistribution
~
Proof: For the MCD subsetH, and in fact any limit
1 0
0
N2(
of a C-step sequence,applyingthe C-step to H yields H
0'0
2)
itself.This meansthatall xi c H satisfy(xi - T)'S- (xand theother195 observationswere drawnfrom
- T)}h:n, whereas all xj ?
T) < 9 = {(x - T)'S-(x
- T) > 9. Take the ellipsoid
H satisfy(x - T)'S-(xj
/'10
2 0
N2
- T) < 9}. Then H c E and
E = {x; (x - T)'S-(x
0
0
N[_ _
X, \H C closure(EC).Note thatthereis at least one point
The MCD has its highestpossible breakdownvalue when
xi E H on the boundaryof E, whereastheremay or may
h = [(n + p + 1)/2] (see Lopuhaa and Rousseeuw 1991),
notbe a pointxj , H on theboundaryof E.
whichbecomesh = 201 here.We now apply(2.1) with500
The same resultwas provedby Butleret al. (1993) un- startingsets H1. Using option 1 yields a
resulting(T,S)
der the extraconditionthata densityexists.Note thatthe whose 97.5% toleranceellipse is shown in
Figure 4(a).
ellipsoidin Corollary1 containsh observationsbut is not Clearly,this resulthas brokendown due to the contaminecessarilythe smallestellipsoid to do so, which would nateddata. On theotherhand,option2 yieldstheresultin
yield the MVE. We know of no techniquelike the C-step Figure4(b), whichconcentrateson the majority(51.25%)
forthe MVE estimator;hence,the latterestimatorcannot of thedata.
be computedfasterin thisway.
The situationin Figure4 is extreme,
butitis usefulforilof our work,Hawkins and Olive (1999) lustrative
Independently
purposes.(The same effectalso occursforsmaller
discovereda versionof Corollary1 in thefollowingform: amountsofcontamination,
especiallyin higherdimensions.)
"A necessaryconditionfortheMCD optimumis that,if we Approach1 has failedbecause each randomsubsetH1 concalculate the distanceof each case fromthe locationvec- tainsa sizable numberof pointsfromthe
majoritygroupas
tor using the scattermatrix,each coveredcase musthave well as fromthe
minoritygroup,whichfollowsfromthe
smallerdistancethanany uncoveredcase." This necessary law of largenumbers.When
froma bad subsetH1,
starting
conditioncould perhapsbe called the"C-condition,"
as op- the iterationswill not convergeto the major solution.On
posed to theC-stepof Theorem1, wherewe provedthata the otherhand,theprobabilityof a (p + 1)-subsetwithout
C-stepalwaysdecreasesdet(S). In theabsenceof Theorem outliersis muchhigher,whichexplainswhy2 yieldsmany
quence det(S1) > det(S2) > det(S3) > ... is nonnegative

2_/

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN

216

TOLERANCEELLIPSE (97.5%)
(a)
c)
c
Lu

oo-

cu

CM

Xo

o.
CU

10

-5

10

X1

15

20

10

20
15
step number

25

30

Figure 5. Covariance Determinantof Subsequent C-Steps in the


reduction
Dataset of Figure 4. Each sequence stops when no further
is obtained.

TOLERANCE ELLIPSE (97.5%)

(b)

the dashed lines correspondto nonrobustresults.To get


a clear picture,Figure 5 only shows the first100 starts.
Aftertwo C-steps (i.e., forj = 3), manysubsamplesH3
thatwill lead to the global optimumalreadyhave a rather
The global optimumis a solutionthat
small determinant.
o
thedeof
the
195 "bad" points.By contrast,
containsnone
to
a
false
classification
the
subsets
of
terminants
H3 leading
CM
are considerablylarger.For thatreason,we can save much
timeand stillobtainthesame resultby taking
computation
just two C-steps and retainingonly the (say, 10) best H3
tI
Otherdatasets,also in more disubsetsto iteratefurther.
20
15
10
5
0
-5
-1( D
these
conclusions.
Therefore,fromnow
mensions,confirm
X1
from
each initialsubsamtwo
will
on we
takeonly
C-steps
Figure 4. Results of IteratingC-Steps StartingFrom 500 Random
subsets
the
10
different
H3 withthe lowest
Subsets H1 of (a) Size h = 201 and (b) Size p+ 1 = 3.
ple H1, select
we
continuetakingCthese
10
and
for
determinants, only
until
subsetsH1 consistingof pointsfromthemajorityandhence steps
convergence.
a robustresult.From now on, we will always use 2.
3.3 Nested Extensions
of havingat
Remark. For increasingn, theprobability
For a small sample size n, theprecedingalgorithmdoes
least one "clean" (p + 1)-subsetamongm random(p + 1)- not take much time.But when n
grows,the computation
subsetstendsto
timeincreases,mainlydue to the n distancesthatneed to
> 0,
1 - (1 - (1 - )P+l)
(3.1) be calculatedeach time.To avoiddoingall thecomputations
in the entiredataset,we will considera special structure.
theproba- When n >
wheree is thepercentageof outliers.In contrast,
1,500,the algorithmgeneratesa nestedsystem
bilityof havingat leastone clean h-subsetamongm random of subsetsthatlooks like Figure6, wherethearrowsmean
h-subsetstendsto 0 because h increaseswithn.
"is a subsetof." The fivesubsetsof size 300 do not overand togethertheyformthe mergedset of size 1,500,
lap,
3.2 Selective Iteration
~e

CM

Each C-step calculates a covariance matrix,its determinant,and all relativedistances.Therefore,reducingthe


numberof C-steps would improvethe speed. But is this
of the algorithm?
possible withoutlosing the effectiveness
betweenrobustsoluIt turnsout thatoftenthe distinction
tionsand nonrobustsolutionsalreadybecomes visibleafter
twoor threeC-steps.For instance,considerthedata of Figure 4 again. The innerworkingsof the algorithm(2.1) are
tracedin Figure5. For each startingsubsampleH1, thedeterminant
of the covariancematrixSj based on h = 201
observationsis plottedversusthe stepnumberj. The runs
yieldinga robustsolutionare shownas solid lines,whereas

( 300 )

( 300 )
( 300 )
( 300 )

(1500 )

.(

n )

( 300 )
Figure 6. Nested System of Subsets Generated by the FAST-MCD
Algorithm.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

217

1. The defaulth is [(n + p + 1)/2], but the user may


whichin turnis a propersubsetof thedatasetof size n. [Alreadythe algorithmof Woodruffand Rocke (1994) made choose any integerh with [(n + p + 1)/2] < h < n.
for this purpose.The only difference The programthen reportsthe MCD's breakdownvalue
use of partitioning
withthe nestedextensionsin Fig. 6 is thatwe workwith (n - h + l)/n. If you are sure thatthe dataset contains
whichis usuallythe case, a
two stages,hence our use of the word "nested,"whereas less than25% contamination,
Woodruffand Rocke partitionedthe entiredataset,which good compromisebetweenbreakdownvalue and statistical
is obtainedby puttingh = [.75n].
yieldsmore and/orlargersubsets.]To constructFigure6, efficiency
2. If h = n, thenthe MCD locationestimateT is the
the algorithmdraws 1,500 observations,
one by one, without replacement.The first300 observationsit encounters averageof thewholedataset,and theMCD scatterestimate
are putin thefirstsubset,and so on. Because of thismech- S is its covariancematrix.Reporttheseand stop.
3. If p = 1 (univariatedata), computethe MCD estifor
anism,each subsetof size 300 is roughlyrepresentative
of RousseeuwandLeroy
the dataset,and the mergedset with 1,500 cases is even mate(T, S) bytheexactalgorithm
morerepresentative.
(1987, pp. 171-172) in O(nlogn) time;thenstop.
4. From here on, h < n and p > 2. If n is small (say,
Whenn < 600,we willkeepthealgorithm
as in theprevi<
n
>
ous section,whileforn 1,500we willuse Figure6. When
600), then
600 < n < 1,500,we willpartition
thedataintoat mostfour
subsetsof 300 or moreobservationsso thateach observation * repeat(say) 500 times:
belongsto a subsetand such thatthe subsetshave roughly
* constructan initialh-subsetH1 usingmethod2 in
the same size. For instance,601 will be splitas 300 + 301
Subsection 3.1-that is, startingfrom a random
and 900 as 450 + 450. For n = 901, we use 300 + 300 + 301,
(p + 1)-subset;
and we continueuntil1,499 = 375 + 375 + 375 + 374. By
* carryout two C-steps(describedin Sec. 2);
splitting601 as 300+ 301 we do notmeanthatthefirstsubset containstheobservationswithcase numbers1,... ,300
* forthe 10 resultswithlowestdet(S3):
but thatits 300 case numberswere drawnrandomlyfrom
* carryout
C-stepsuntilconvergence;
1,...,601.
Whenevern > 600 (and whethern < 1,500 or not),our
* reportthesolution(T, S) withlowestdet(S).
new algorithm
fortheMCD will taketwoC-stepsfromseveral starting
5. If n is larger(say,n > 600), then
subsamplesH1 withineach subset,witha total
of 500 startsforall subsetstogether.For everysubsetthe
* constructup to fivedisjointrandomsubsetsof size
best 10 solutionsare stored.Then the subsetsare pooled,
a
set
with
nsub accordingto Section3.3 (say,fivesubsetsof size
at
most
observations.
Each
1,500
yielding merged
of these(at most50) availablesolutions(TSub,Ssub) is then
nsub = 300);
extendedto the mergedset. That is, startingfromeach
* insideeach subset,repeat500/5= 100 times:
we
continue
(Tsub, Ssub),
takingC-steps,whichnowuse all
* constructan initial subset H1 of size hsub =
1,500observationsin themergedset.Onlythebest 10 solu-

tions (Tmerged, Smerged) will be considered further.Finally,

each of these 10 solutionsis extendedto thefulldatasetin


thesame way,and thebestsolution(Tfull,Sfull)is reported.
Because thefinalcomputations
are carriedout in theentiredataset,theytake moretimewhen n increases.In the
interestof speed we can limitthe numberof initialsolu-

tions (Tmerged, Smerged) and/or the number of C-steps in

[nsub(h/n)];

*
carryout two C-steps,usingnsuband hsub;
* keep the 10 best results(Tsub,
Ssub);
* pool the subsets,yieldingthemergedset (say,of size
nmerged = 1,500);
* in themergedset,repeatforeach of the 50 solutions
(Tsub, Ssub):

thefulldatasetas n becomes large.


The mainidea of thissubsectionwas to carryoutC-steps
in severalnestedrandomsubsets,startingwithsmall subsets of around300 observationsand endingwiththeentire
datasetof n observations.Throughoutthis subsection,we
havechosenseveralnumberssuchas fivesubsetsof 300 observations,500 starts,10 best solutions,and so on. These
choiceswerebased on variousempiricaltrials(notreported
our choices as defaultsso theuser
here).We implemented
does not have to choose anything,
but of course the user
the
defaults.
may change

* in thefulldataset,repeatforthemfullbest results:

4. THE RESULTING ALGORITHM FAST-MCD


Combiningall thecomponentsof theprecedingsections
whichwe will call FAST-MCD.
yieldsthe new algorithm,
Its pseudocodelooks as follows:

We willreferto theprecedingas theFAST-MCD algorithm.


Note thatit is affineequivariant:When thedata are translated or subjectedto a lineartransformation,
the resulting
The
(Tfull,Sfull)will transform
accordingly. computerprogramcontainstwo moresteps:

* carry out two C-steps, using nmergedand hmerged=


[rmerged(h/n)];

* keep the 10 best results


(Tmerged, Smerged);

* take several
C-steps,usingn and h;
*
the
best
finalresult(Tfull,Sfull).
keep
Here,mfulland thenumberof C-steps(preferably,
until
convergence)dependon how largethedatasetis.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

218

PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN

6. To obtainconsistency
whenthedatacome froma multivariatenormaldistribution,
we put
TMCD=

Tfull

and

(i)

medid2

SMCD

(Tfull,Sfui)

Xp,.5

7. A one-stepreweightedestimateis obtainedby

Sfu

(0
0

*
~~~ ~ ~ ~~~S.
~~~ ~ ~ ~ ~ ~
S

cm

T1

(=l

WiXi

(
Wi

0
0

a_

0~~~~
0~~~~~

C -

S1 =

Wi(Xix

- Tl)

)/

(E Win

I i=l

where
where
wi

-4

- 1 if

d(TMcD,SMcD)(i)

<

= 0 otherwise.

-2

0
X1

Figure7. Exact FitSituation(n = 100, p = 2).

Xp,.975
There

are

55

observations

in

the

entire

dataset

of

that lie
on the line with the equaThe programFAST-MCD has been thoroughlytested 100 observations
tion
and can be obtainedfromour Web site http://win-www.
.000000(xil - ml) + 1.000000(i2 - m2) = 0,
It has been incorporated
uia.ac.be/u/statis/index.html.
into where the mean (ml, m2) of these observations
is the
S-PLUS 4.5 (as the function"cov.mcd")and it is also in MCD location:
.10817
SAS/IML 7 (as thefunction"MCD").

5.00000

and their

5.

covariance

matrix is the MCD scatter

matrix

EXACT FIT SITUATIONS


1.40297 .00000
An importantadvantageof the FAST-MCD algorithm
.00000 .00000
is thatit allows for exact fitsituations-thatis, when h
the data are in an "exact
Therefore,
fit" position.
or more observationslie on a hyperplane.Then the algo- In such a situation
the MCD scatter
matrix has deterrithmstill yields the MCD location T and scattermatrix minant 0, and its tolerance ellipse becomes the line
S, the latterbeing singularas it should be. From (T, S) of exact fit.
the programthen computes the equation of the hyperIf theoriginaldata werein p dimensionsand it turnsout
plane.
thatmostof thedatalie on a hyperplane,
itis possibleto apWhen n is largerthan(say) 600, thealgorithmperforms
plyFAST-MCD againto thedatain this(p - 1)-dimensional
manycalculationson subsetsof the data. To deal withthe
space.
combinationof largen and exactfits,we added a fewsteps
to the algorithm.Suppose that,duringthecalculationsin a
6. PERFORMANCE OF FAST-MCD
subset, we encounter some (Tsub, Ssub) with det(Ssub) = 0.
To get an idea of the performance
of the overall algoThen we knowthatthereare h,,b or moreobservationson rithm,we start
by applyingFAST-MCD to some small
thecorresponding
Firstwe checkwhetherh or datasetstakenfromRousseeuw and
hyperplane.
Leroy (1987). To be
morepointsof thefulldatasetlie on thishyperplane.
If so, precise,thesewereall regressiondatasets,butwe ranFASTwe compute(Tfull,Sfull)as themeanand covariancematrix MCD
variables-thatis, notusing
onlyon theexplanatory
of all pointson thehyperplane,
reportthisfinalresult,and theresponsevariable.The firstcolumnof Table 1 liststhe
stop; if not, we continue.Because det(Ssub) = 0 is the name of each dataset,followedby n and p. We used the
best solution for that subset, we know that (Tsub,
Ssub) will defaultvalue of h = [(n + p + 1)/2]. The next column
be among the 10 best solutionsthatare passed on. In the showsthenumberof
starting
(p + 1)-subsetsused in FASTmergedset we take the set H' of the hmergedobservations MCD, whichis usually500 exceptfortwodatasetsin which
withsmallestorthogonaldistancesto the hyperplane,and the numberof
possible (p + 1)-subsetsout of n was fairly
startthe next C-step fromH'. Again, it is possible that
small-namely,(1) = 220 and (38) = 816-so we used all
duringthecalculationsin themergedsetwe encountersome of them.
(Tmerged, Smerged) withdet(Smerged) = 0, in whichcase we
The nextentryin Table 1 is the resultof FAST-MCD,
repeattheprecedingprocedure.
givenhereas the finalh-subset.Comparingthesewiththe
As an illustration,
the datasetin Figure7 consistsof 45 exact MCD
algorithmof Agullo (personalcommunication,
observations
generatedfroma bivariatenormaldistribution, 1997), it turnsout thatthese h-subsetsdo
yield the exact
plus 55 observationsthatwere generatedon a straightline globalminimumof the
The nextcolumn
objectivefunction.
(using a univariatenormaldistribution).
The FAST-MCD showstherunningtimeof FAST-MCD in secondson a Sun
program(withdefaultvalue h = 51) findsthisline within Ultra 2170. These times are much shorterthan those of
.3 seconds.A partof theoutputfollows:
our MINVOL programforcomputingtheMVE estimator.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

219

Table 1. Performanceof the FAST-MCDand FSA Algorithms


on Some Small Datasets
Time (seconds)
Dataset

Starts

Heart
Phosphor
Stackloss
Coleman
Wood
Salinity
HBK

12
18
21
20
20
28
75

2
2
3
5
5
3
3

220
816
500
500
500
500
500

Best h-subset found


1 3 4
3 5 8
4 5 6
2 3 4
1 2 3
1 2 6
15 16
33 35
59 61

FAST-MCD

FSA

.6
1.8
2.1
4.2
4.3
2.4
5.0

.6
3.7
4.6
8.9
8.2
8.6
71.5

5 7 9 11
9 11 12 13 14 15 17
7 8 9 10 11 12 13 14 20
5 7 8 12 13 14 16 17 19 20
5 9 10 12 13 14 15 17 18 20
7 8 12 13 14 18 20 21 22 25 26 27 28
17 18 19 20 21 22 23 24 26 27 31 32
36 37 38 40 43 49 50 51 54 55 56 58
63 64 66 67 70 71 72 73 74

We mayconcludethatforthesesmalldatasetsFAST-MCD suffice,whereasno previousalgorithmwe know of could


handlesuch largedatasets.
gives veryaccurateresultsin littletime.
Let us now trythe algorithmon largerdatasets,with
The currently
most well-knownalgorithmfor approxin > 100. In each dataset,we generatedover 50% of the matingtheMCD estimatoris thefeasiblesubsetalgorithm
normaldistribution (FSA) of Hawkins(1994). Insteadof C-steps,it uses a difpointsfromthe standardmultivariate
and
the
from
remainingpoints
Np(0, Ip),
Np(p,,Ip), where ferentkindof steps,whichforconveniencewe will baptize
p = (b,b,... b)' withb = 10. This is the model of "shift "I-steps,"wheretheI standsfor"interchanging
points."An
outliers."For each dataset,Table 2 listsn,p, thepercentage I-stepproceedsas follows:
of majoritypoints,and the percentageof contamination. Given the h-subsetHold with its average Told and its
The algorithmalwaysused 500 startsand thedefaultvalue covariancematrixSold,
of h = [(n + p + 1)/2].
of h =
+ p p+
* repeatforeach i E Hold and each j X Hold
-1)/2]i ~[(nThe resultsof FAST-MCD
are givenin thenextcolumn,
under"robust."Here "yes" meansthatthe correctresultis
* put Hij
(Hold\ {i}) U {j}
obtained-thatis, corresponding
to thefirstdistribution
(i.e., removepointi and add pointj);
[as
* compute
in Fig. 4(b)]-whereas "no" standsfor the nonrobustreAi, = det(Sold) - det(S(Hi,));
describethe entiredataset[as
suit,in whichthe estimates
wh l
t
. ,-,.
.
*
,. , ./
, M^?
thei' ad
and j'
i the
,
j' withlargestAi,j,;
in Fig. 4(a)]. Table ^2 lists
data situations
* keep
f
with
<
highest
and stop;
H old and
ifA~< Hnew
, pt
put new Hold
stop;
percentage of outlying observations still yielding the clean

result with FAST-MCD, as was suggested by a referee.That

is, thetablesays whichpercentageof outliersthealgorithm


can handle for given n and p. Increasingthe numberof
startsonly slightlyimprovesthispercentage.The computationtimeswere quite low forthe givenvalues of n and
p. Even fora samplesize as highas 50,000,a fewminutes

' >

t Hnew = H,'j

An I-steptakesO(h(n - h)) = 0(n2) timebecause all pairs


(i,j) are considered.If we would computeeach S(Hij)
fromscratch,the complexitywould even become 0(n3),
but Hawkins (1994, p. 203) used an update formulafor
det(S(Hi,j)).

Table 2. Performanceof the FAST-MCDand FSA Algorithms


on LargerDatasets, WithTimein Seconds
FAST-MCD
n
100

500

1,000

10,000

50,000

p
2
5
10
20
2
5
10
30
2
5
10
30
2
5
10
30
2

5
10
30

FSA

% Np(O, Ip)

% Np(1u,Ip)

Robust

Time

Robust

51
53
63
77
51
51
64
77
51
51
60
76
51
51
63

49
47
37
23
49
49
36
23
49
49
40
24
49
49
37

yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes

2
5
40
70
7
25
84
695
8
20
75
600
9
25
85

yes
no
no
no
no
no
no
no
no

76
51

24
49

yes
yes

700
15

58
75

42
25

yes
yes

140
890-

51

49

yes

Time
50
80
110
350
2,800
3,800
4,100
8,300
20,000
-

45

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

220

PETER J. ROUSSEEUW AND KATRIENVAN DRIESSEN

The I-step can be iterated: If det(Snew) < det(Sold), we

can take anotherI-stepwithHnew; otherwise,we stop.The

resulting sequence det(S1) > det(S2) > ... must converge

Hawkins (1994), whereasthe new versionof FSA is subfaster(althoughit retainsthesame computational


stantially
complexityas the originalFSA due to 2 in the preceding
list).
In conclusion, we personallypreferthe FAST-MCD
algorithmbecause it is both robust and fast, even for
largen.

aftera finitenumberof steps; that is, det(S,) = 0 or


det(Sm) = det(SmI_), so det(Sm) can no longerbe reduced by an I-step.This is again a necessary(butnotsufficient)conditionfor(Tm, Sm) to be theglobal minimumof
theMCD objectivefunction.
In ourterminology,
Hawkins's
7. APPLICATIONS
FSA algorithmcan be writtenas follows:
Let us now look at some applicationsto comparethe
* repeatmanytimes:
FAST-MCD resultswiththeclassical mean and covariance
* draw an initialh-subsetH1 at random;
matrix.At the same timewe will illustratea new tool,the
* carryout
I-stepsuntilconvergence,
yieldingHm;
distance-distance
plot.
* keep the Hm withlowestdet(Sm);
1.
We
startwiththecoho salmondataset(see
Example
* reportthisset Hm as well as (Tm, Sm).
Nickelson1986) withn = 22 and p = 2, as shownin FigIn Tables 1 and 2 we have appliedtheFSA algorithmto
ure 8(a). Each data pointcorrespondsto one year.For 22
thesame datasetsas FAST-MCD, usingthesamenumberof
theproductionof coho salmonin thewild was meastarts.For thesmalldatasetsin Table 1,theFSA and FAST- years
sured,in theOregonProductionArea. The x-coordinateis
MCD yielded identicalresults.This is no longertruein
the logarithmof millionsof smolts,and the y-coordinate
Table 2, wheretheFSA beginsto findnonrobustsolutions.
is the logarithmof millionsof adultcoho salmon.We see
This is because of thefollowing:
thatin mostyearstheproductionof smoltslies between2.2
1. The FSA startsfromrandomlydrawnh-subsetsH1. and 2.4 on a logarithmicscale, whereasthe productionof
Hence, for sufficiently
large n all of the FSA startsare adultslies between-1.0 and .0. The MCD toleranceellipse
and subsequentiterationsdo notget away from excludestheyearswitha lowersmoltsproduction,
nonrobust,
thereby
thecorresponding
local minimum.
We saw thesame effectin Section3.1, whichalso explained
(a)
whyit is betterto startfromrandom(p + 1)-subsetsas in
MINVOL and in FAST-MCD.
The tables also indicatethatthe FSA needs more time
thanFAST-MCD. In fact,time(FSA)/time(FAST-MCD)increases from1 to 14 forn goingfrom12 to 75. In Table 2,
the timingratiogoes from25 (forn = 100) to 2,500 (for
n = 1,000),afterwhichwe could no longertimetheFSA
The FSA algorithm
is moretime-consuming
than
algorithm.
FAST-MCD because of thefollowing:
2. An I-steptakesO(n2) time,comparedto O(n) forthe
C-stepof FAST-MCD.
3. Each I-step swaps only one point of Hold with one
point outside Hold. In contrast,each C-step swaps hHoldn HnewI pointsinsideHoldwiththesame numberoutside of Hold.Therefore,
moreI-stepsare needed,especially
forincreasingn.
4. The FSA iteratesI-steps untilconvergence,starting (b)
fromeach H1. On theotherhand,FAST-MCD reducesthe
numberof C-steps by the selectiveiterationtechniqueof
Section 3.2. The latterwould not workforI-stepsbecause
of 3.
5. The FSA carriesout all its I-stepsin the fulldataset
of size n, even for large n. In the same situation,FASTMCD applies thenestedextensionsmethodof Section3.3,
so most C-steps are carriedout fornsub= 300, some for
nmerged

TOLERANCE ELLIPSES (97.5%)

L
O=

EI
0

smolts
logmillion
DISTANCE-DISTANCE
PLOT

CD-

*21

a)
a,
co

11'
.0
o
0

a-

~~~~.1

*15
CM

= 1,500, and only a few for the actual n.

..

While thisarticlewas underreview,Hawkinsand Olive


,
0 --"""""
,
0.5
0.0
1.0
(1999) proposedan improvedversionof theFSA algorithm,
1.5
2.0
2.5
3.0
Mahalanobis
Distance
as describedat theend of ourSection2. To avoidconfusion,
Figure8. Coho Salmon Data: (a) ScatterplotWith97.5% Tolerance
we would like to clarifythatthetimingsin Tables 1 and 2
Ellipses
Describingthe MCD and the Classical Method; (b) Distancewere made withthe originalFSA algorithmdescribed
by Distance Plot.
O

.....

.....................................

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

theclassical tolerance
markingthemas outliers.In contrast,
contains
the
whole
dataset
and thusdoes not
nearly
ellipse
detecttheexistenceof faroutliers.
Let us now introducethe distance-distanceplot (D-D
plot),whichplotstherobustdistances(based on theMCD)
versusthe classical Mahalanobis distances.On both axes
in Figure8(b) we have indicatedthe cutoffvalue X2,.975
(herep = 2, yielding X 975= 2.72). If thedata werenot
contaminated(say, if all the data would come froma sinthenall pointsin theD-D
gle bivariatenormaldistribution),
would
near
line.
In this example many
lie
the
dotted
plot
lie
in
the
where
both
distancesare regupoints
rectangle
lie
lar, whereasthe outlyingpoints higher.This happened
because the MCD ellipse and the classical ellipse have a
different
orientation.
Naturally,the D-D plot becomes more usefulin higher
dimensions,whereit is not so easy to visualize thedataset
and theellipsoids.

221

Somethinghappenedin the productionprocessthatwas


not visible fromthe classical distancesshownin Figure 1.
Figure 9(a) also shows a remarkablechange afterthe first
and
100 measurements.
These phenomenawereinvestigated
the
at
Note
that
the
D-D
by
Philips.
interpreted
engineers
plot in Figure9(b) again contraststhe classical and robust
analysis.In Figure9, (a) and (b), one can in factsee three
groups-the first100 points,thosewithindex491 to 565,
and themajority.
Problem2 (continued). We now apply FAST-MCD to
the same n = 132,402 celestial objects withp = 6 variables as in Figure 3, which took only 2.5 minutes.(In
fact,runningthe programon the same objects in all 27
dimensionstook only 18 minutes!)Figure 10(a) plots the
resultingMCD-based robustdistances.In contrastto the
Mahalanobis distancesin Figure 3,
homogeneous-looking
therobustdistancesin Figure 10(a) clearlyshowthatthere
is a majoritywith RD(xi) <
2X. 975 as well as a sec-

Problem1 (continued). Next we considerProblem1 in ond group with RD(xi) between 8 and 16. By exchangSection 1. The Philips data represent677 measurements ing our findingswith the astronomersat the California
of metalsheetswithnine componentseach, and theMaha- Instituteof Technology,we learnedthatthe lower group
lanobisdistancesin Figure1 indicatedno groupsof outliers. consists mainlyof stars and the upper group mainlyof
The MCD-based robustdistancesRD(xi) in Figure9(a) tell galaxies.
Our main pointis thatthe robustdistancesseparatethe
a different
story.We now see a stronglydeviatinggroupof
data
in two partsand thusprovidemore information
than
outliers,rangingfromindex491 to index565.
(a)

(a)
*

L.)

S~

~~V*

'

U"

A~~~~~~~~~
~~~ ~ ~ ~~~~~~~~'

i5'()

.0

*?

3qr~~J..

0
'S.

*.~~~~~~~~~~~~~~~~

(0

0
U,

oC

*? 0

0.

:,.o 0%

*1

00

-0

r
0

O-

200

400
Index

600

2000

4000

6000
Index

10000

8000

(b)

(b)

0
Figure 9. Philips Data: (a) Plot of Robust Distances; (b) DistanceDistance Plot.

6
4
Mahalanobis
Distance

Figure 10. DigitizedPalomar Data: (a) Plot of Robust Distances of


Celestial Objects; (b) TheirDistance-DistancePlot.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

PETER J. ROUSSEEUW AND KATRIENVANDRIESSEN

222

8. CONCLUSIONS
theMahalanobis
distances.
Thatrobust
distances
andMahabehavedifferently
lanobisdistances
is illustrated
in Figure The algorithm
FAST-MCDproposedin thisarticleis
10(b),wherewe see thestarsnearthediagonallineandthe specifically
tailoredto theproperties
of theMCD estimagalaxiesaboveit.
tor.The basicideasaretheC-step(Theorem1 in Sec. 2),
Of courseouranalysisofthesedatawas muchmoreex- theprocedure
forgenerating
initialestimates
(Sec. 3.1),setensiveandalsousedotherdata-analytic
notde- lectiveiteration
techniques
(Sec. 3.2),andnestedextensions
(Sec. 3.3).
scribedhere,buttheabilityto computerobustestimates By exploiting
thespecialstructure
oftheproblem,
thenew
of locationand scatterforsuchlargedatasetswas a key algorithm
is faster
andmoreeffective
thangeneral-purpose
tool.Based on our work,theseastronomers
are thinkingtechniques
suchas reducing
theobjectivefunction
bysucaboutmodifying
theirclassification
ofobjectsintostarsand cessively
Simulations
have
shown
that
interchanging
points.
forthefaintlightsourcesthatarediffi- FAST-MCDis abletodealwithlargedatasets
galaxies,especially
whileoutruncultto classify.
forMVE andMCD byordersof
ningexisting
algorithms
Another
ofFAST-MCDis itsability
magnitude.
advantage
robust to detectexactfitsituations.
Example2. We end thissectionby combining
withrobustregression.
The firedata(Anlocation/scatter
Due to theFAST-MCDalgorithm,
theMCD becomes
drewsandHerzberg1985)reported
theincidences
of fires accessibleas a routine
toolforanalyzing
multivariate
data.
in47 residential
areasofChicago.Onewantstoexplainthe Without
extracostwe also obtaintheD-D plot,a newdata
incidence
offirebytheageofthehouses,theincomeofthe
displaythatplotstheMCD-basedrobustdistancesversus
families
oftheft.
Forthis theclassicalMahalanobis
livingin them,andtheincidence
distances.
Thisis a usefultoolto
we applytheleasttrimmed
squares(LTS) methodof ro- explorestructure(s)
in thedata.Otherpossibilities
include
bustregression,
withtheusualvalueof h = [(3/4)n]= 35. an MCD-basedPCA androbustified
versions
ofothermulIn S-PLUS 4.5, the function
"ltsreg"now automaticallytivariate
analysismethods.
calls thefunction
which
runstheFAST-MCD
"cov.mcd,"
to obtainrobustdistancesin x-spacebasedon
algorithm,
ACKNOWLEDGMENTS
theMCD withthesameh. Moreover,
S-PLUS automatiWe thankDoug Hawkinsand JoseAgulloformaking
callyprovidesthediagnostic
plotof Rousseeuwand van
availableto us. We also dedicatespecial
Zomeren(1990), whichplotstherobustresidualsversus theirprograms
therobustdistances.For thefiredata,thisyieldsFigure thanksto GertjanOtten,FransVanDommelen,
and Herman
Veraa
for
us
access
to
the
which
shows
data
andto
thepresenceof one vertical
11,
outlier-that
giving
Philips
witha smallrobustdistanceanda large S. C. Odewahnandhisresearch
Inis, an observation
groupattheCalifornia
of Technology
forallowingus to analyzetheirdigLTS residual.We also see twobad leveragepoints-that stitute
to therefereesand
x andsuchthat(x,y) itizedPalomardata.We are grateful
is, observations
(x,y) withoutlying
Technometrics
editors
Max
Morris
does not followthe lineartrendof the majority
and
KarenKafadarfor
of the
comments
thepresentation.
data.The otherobservations
withrobustdistancesto the helpful
improving
ofthevertical
cutoff
linearegoodleverage
right
pointsbecausetheyhavesmallLTS residualsandhencefollowthe
APPENDIX: PROOF OF THEOREM 1
samelinearpattern
as themaingroup.In Figure11 we see
Proof. Assumethatdet(S2) > 0; otherwise
theresult
thatmostof thesepointsare merelyboundary
cases,exis
satisfied.
We
can
thus
=
already
computed2(i) d(T2,s,)(i)
ceptforthetwoleveragepointsthatare reallyfaroutin
forall i = 1,..., n. Using IH21= h and the definition
of
x-space.
(T2,S2), we find
DIAGNOSTICPLOT

1
-ptr E (xi

d(i)
2
iEH2

T2)S21

(x

-T2)'

iCH2

= h tr E S2 (xi-T2)(xi

-a
u,
a)

iEH2

2.5
*

c)

-1 trSS2p

O
0

'O
La

CI

:
*

'a

*
S

05
c,

*?

A*-2.5
-2.5
,

(A.1)

Moreover,
put

* S
*

CM
.

- tr(I)= 1.

Figure 11. DiagnosticPlot of the Fire Dataset.

E
iEH2

, .

4
6
8
10
RobustDistancebased on theMCD

A:-p

d1(i)

2
l )i:n
(d

i=1

12

jEH1

2(j)

1,

(A.2)
whereA > 0 becauseotherwise
det(S2) = 0. Combining

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions

A FAST ALGORITHM FOR THE MINIMUMCOVARIANCE DETERMINANT ESTIMATOR

(A.1) and (A.2) yields

hp

E d(TI'AS1)(i)

ZEH2

h rSll(xiE (XiiEH2

T1)

G (T2
bt
Griibel (1988) proved that (T2, S2) is the unique
minimizer of det(S)
for which
among all (T,S)
= 1. This implies that det(S2) <
(1/hp) ZiH2 d(TS)(i)

det(ASi). On theotherhandit followsfromtheinequality


(A.2) thatdet(AS1) < det(S1), hence
det(S2) < det(ASi) < det(Si).

(A.3)

Moreover, note that det(S2) = det(S1) if and only if both

inequalitiesin (A.3) are equalities.For the first,we know


from Griibel's result that det(S2) = det(ASi) if and only if

(T2, S1) = (T!, ASi). For the second,det(AS1) = det(S1)

if and only if A = 1; that is, S1 = ASI. Combining both


yields (T2, S2) =(T1, S).
[ReceivedDecember1997. RevisedMarch 1999.]

REFERENCES
Agull6,J. (1996), "Exact IterativeComputationof theMultivariateMinimumVolumeEllipsoidEstimatorWitha Branchand BoundAlgorithm,"
in Proceedingsin ComputationalStatistics,ed. A. Prat, Heidelberg:
Physica-Verlag,
pp. 175-180.
Andrews,D. F., and Herzberg,A. M. (1985), Data, Springer-Verlag,
New
York.
Butler,R. W., Davies, P. L., and Jhun,M. (1993), "Asymptotics
for the
MinimumCovarianceDeterminant
TheAnnalsof Statistics,
Estimator,"
21, 1385-1400.
T. P. (1993), "A BoundedInfluence,
Coakley,C. W., and Hettmansperger,
JournaloftheAmerHigh Breakdown,Efficient
RegressionEstimator,"
ican StatisticalAssociation,88, 872-880.
Cook, R. D., Hawkins,D. M., and Weisberg,S. (1992), "Exact Iterative
Computationof the Robust MultivariateMinimumVolume Ellipsoid
Statisticsand ProbabilityLetters,16, 213-218.
Estimator,"
Croux, C., and Haesbroeck,G. (in press), "InfluenceFunctionand Efficiency of the MinimumCovarianceDeterminantScatterMatrixEstimator,"Journalof MultivariateAnalysis.
Davies, L. (1992), "The Asymptoticsof Rousseeuw's MinimumVolume
The Annalsof Statistics,20, 1828-1843.
Ellipsoid Estimator,"
Donoho, D. L. (1982), "BreakdownPropertiesof MultivariateLocation
Estimators,"unpublishedPh.D. qualifyingpaper, HarvardUniversity,
Dept. of Statistics.

223

oftheCovarianceMatrix,"
Gribel,R. (1988), "A MinimalCharacterization
Metrika,35, 49-52.
Hampel, F. R., Ronchetti,E. M., Rousseeuw,P. J., and Stahel, W. A.
(1986), RobustStatistics,The ApproachBased on InfluenceFunctions,
New York:Wiley.
Hawkins,D. M. (1994), "The Feasible SolutionAlgorithmforthe MinimumCovarianceDeterminant
Estimatorin Multivariate
Data," ComputationalStatisticsand Data Analysis,17, 197-210.
Hawkins,D. M., and McLachlan,G. J. (1997), "High-Breakdown
Linear
Discriminant
Analysis,"JournaloftheAmericanStatisticalAssociation,
92, 136-143.
Hawkins,D. M., and Olive, D. J. (1999), "ImprovedFeasible Solution
forHigh BreakdownEstimation,"ComputationalStatistics
Algorithms
and Data Analysis,30, 1-11.
Lopuhaa,H. P., and Rousseeuw,P. J.(1991), "BreakdownPointsof Affine
Locationand CovarianceMatriEquivariantEstimatorsof Multivariate
ces," The Annalsof Statistics,19, 229-248.
Maronna,R. A. (1976), "Robust M-estimatorsof MultivariateLocation
and Scatter,"The Annalsof Statistics,4, 51-56.
Meer,P., Mintz,D., Rosenfeld,A., and Kim, D. (1991), "RobustRegression Methodsin ComputerVision:A Review,"International
Journalof
ComputerVision,6, 59-70.
Nickelson,T. E. (1986), "Influenceof Upwelling,Ocean Temperature,
and
SmoltAbundanceon Marine Survivalof Coho Salmon (Oncorhynchus
Kisutch)in theOregonProductionArea,"Canadian JournalofFisheries
and Aquatic Sciences,43, 527-535.
R. J.,and Gal, R. (1998), "Data
Odewahn,S. C., Djorgovski,S. G., Brunner,
From the Digitized Palomar Sky Survey,"technicalreport,California
Instituteof Technology.
D. L. (1996), "Identification
of Outliersin
Rocke, D. M., and Woodruff,
MultivariateData," Journalof theAmericanStatisticalAssociation,91,
1047-1061.
Rousseeuw,P. J. (1984), "Least Median of Squares Regression,"Journal
of theAmericanStatisticalAssociation,79, 871-880.
EstimationWithHigh BreakdownPoint,"in
(1985), "Multivariate
MathematicalStatisticsand Applications,VolB, eds. W. Grossmann,G.
Pflug,I. Vincze,and W. Wertz,Dordrecht:Reidel,pp. 283-297.
to Positive-Breakdown
(1997), "Introduction
Methods,"in Handbook of Statistics,Vol. 15: RobustInference,eds. G. S. Maddala and
C. R. Rao, Amsterdam:Elsevier,pp. 101-121.
Rousseeuw,P. J.,and Leroy,A. M. (1987), RobustRegressionand Outlier
Detection,New York:Wiley.
Rousseeuw,P. J.,and van Zomeren,B. C. (1990), "UnmaskingMultivariate Outliersand Leverage Points,"Journalof theAmericanStatistical
Association,85, 633-639.
Simpson,D. G., Ruppert,D., and Carroll,R. J.(1992), "On One-StepGMestimatesand Stabilityof Inferencesin LinearRegression,"Journalof
theAmericanStatisticalAssociation,87, 439-450.
D. L., and Rocke, D. M. (1993), "HeuristicSearchAlgorithms
Woodruff,
for the MinimumVolume Ellipsoid,"Journalof Computationaland
GraphicalStatistics,2, 69-95.
(1994), "ComputableRobustEstimationof MultivariateLocation
and Shape in High DimensionUsing CompoundEstimators,"Journal
of theAmericanStatisticalAssociation,89, 888-896.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

This content downloaded from 157.253.50.50 on Sun, 01 Nov 2015 04:37:12 UTC
All use subject to JSTOR Terms and Conditions