Escolar Documentos
Profissional Documentos
Cultura Documentos
We proposed a robust method of estimating covariance matrix in multivariate data set. The goal is to
compare the proposed method with the most widely used robust methods (Minimum Volume Ellipsoid
and Minimum Covariance Determinant) and the classical method (MLE) in detection of outliers at
different levels and magnitude of outliers. The proposed robust method competes favourably well with
both MVE and MCD and performed better than any of the two methods in detection of single or fewer
outliers especially for small sample size and when the magnitude of outliers is relatively small.
Key words: Covariance matrix, minimum volume ellipsoid (MVE), minimum covariance determinant (MCD),
mahalanobis distance, optimality criteria.
INTRODUCTION
Let X = {x1 , x 2 , x3 ,−,−,−, x m } be a set of m points Covariance matrix plays a prominent role in multivariate
data analysis. It measures the spread of the individual
in ℜ , where xi , i = 1, 2, 3, , , m, are independent and
p
variables as well as the level of inter-relationship (inter-
corelation) that may exist between pairs of the variables
identically distributed multivariate normal N P ( µ , Σ) . The
in the multivariate data. The statistic is used in many
usual Maximum Likelihood Estimate (MLE) method of multivariate techniques as a measure of spread and inter-
estimating µ and are the sample mean vector and correlation among variables. Such techniques include
sample covariance matrix x and S, respectively which is multivariate regression model, Principal Component and
given as; Factor Analyses, Canonical Correlation and Linear
Discriminant and even in Multivariate Statistical Process
2
For instance, in Multivariate Control chart like Hotelling T covariance matrix of a particular halfset. For more
chart, where the presence of multiple outliers can affect detailed discussion on MVE see Daves (1987), Lopuhaa
the estimation of the covariance matrix, using Maximum and Rousseeuw (1991), Titterington (1975) and (Agullo,
Likelihood Estimation (MLE) method to the extent that all 1996).
the outliers will go undetected by the chart (Vargas
2003). The same effect can be experienced in other
analyses like Principal Component and Factor analysis Minimum covariance determinant (MCD)
and other multivariate techniques. An alternative high breakdown estimation procedure to
Outliers can heavily influence the estimation of the the MVE is an estimator based on the Minimum
covariance matrix and subsequently the parameters or Covariance Determinant (MCD), which was first proposed
statistics that are needed to be derived from it. Hence, a by Rousseeuw (1984). It is obtained by finding the halfset
robust estimate of the covariance matrix that will not be of multivariate data points that gives the minimum value
affected by outliers is required to obtain valid and reliable of the determinant of the covariance matrix. The resulting
results (Hubert and Engelen, 2007). There have been estimator of location is the sample mean vector of the
many robust methods of estimating the covariance matrix points that is the halfset and the estimator of the
of a multivariate data. Such methods include Minimum dispersion is the sample covariance matrix of the points
Volume Ellipsoid (MVE), Minimum Covariance multiplied by an appropriate constant to ensure
Determinant (MCD), S-Estimator, M-Estimator and consistency just as was done for MVE.
Orthogonalized Gnanadesikan-Kettering (OGK) methods. The MCD estimators are intuitively appealing because
This paper gives a brief overview of the most widely a small value of the determinant corresponds to near
used methods (Minimum Volume Ellipsoid and Minimum linear dependencies of the data in the p-dimensional
Covariance Determinant) and also introduces our robust space that is because a small determinant corresponds
estimation method. We compared our robust method with to a small Eigenvalue which suggests a near linear
the two methods based on both simulated and real life dependency that suggests that there is a group of points
multivariate data in detection of outliers as bases for that are similar to each other (Jensen et al., 2002).
comparison.
Let p < m/2, let X = {x1 , x 2 , x3 ,−,−,−, x p } be a set of
(Titterington, 1975). of {xi }im=1 of size k = p+1 that will satisfy some optimality
The MVE for the data set {x } m criteria. Therefore, we sample without replacement a
i i =1 must go through at
least p+1 and at most h support vectors. Thus, the MVE sample of size k from m, this will give C pm+1 possible
estimates of the location and dispersion do not correspond
to the sample mean vector and sample variance-
subsets of size p+1, x j1, x j 2 ,−,−,−, x jp +1 , {x }
j
p +1
j =1
. If each
Oyeyemi and Ipinyomi 003
subset is denoted by J j , j = 1, 2, , , , C pm+1 . For each J j , 2, 3, 4, , , , h-p-1) distances are selected and their
corresponding sample units (points) are used to compute
we estimate the variance-covariance matrix, CJ;
1
the next x* and S* as follows; x* = xi* and
p+ j
CJ =
1
(x j − x j )(x j − x j )T 1
i =1
Table 1. The signal probability when there is one outlier. Table 4. The signal probability when there are 7 outliers.
Methods Methods
NCP NCP
Classical MVE MVD Proposed Classical MVE MVD Proposed
5 0.1100 0.0450 0.0550 0.0750 5 0.0100 0.0165 0.0170 0.0210
10 0.4100 0.1300 0.1400 0.2500 10 0.0120 0.0450 0.4800 0.0470
15 0.5500 0.3000 0.3100 0.4200
15 0.0070 0.0900 0.0840 0.0840
20 0.7200 0.4200 0.4000 0.5100
20 0.0100 0.1560 0.147 0.1300
25 0.8100 0.5300 0.5500 0.6750
30 0.9400 0.7600 0.7300 0.8250 25 0.0100 0.1980 0.2000 0.1820
30 0.0030 0.2610 0.2340 0.1960
Table 5. Data set and Hotelling´s-T2 statistic using the Classical, MCD, MVE and the
Proposed Robust Method (PRM) when there is only one (1) outlier.
Table 6. Data set and Hotelling´s-T2 statistic using the Classical, MCD, MVE and the
Proposed Robust Method (PRM) when there are three (3) outliers.
Table 7. Data set and Hotelling´s-T2 statistic using the Classical, MCD, MVE and the Proposed
Robust Method (PRM) when there five (5) outliers
were modified to (0.880, 65.230) and (0.980, 66.080), are shown in Table 7. From the table, the two robust
2
respectively. The resulting Hotelling´s-T statistics for the methods, MCD and MVE, identified all the outliers (obser-
four methods together with the data are presented in vations 14, 18, 24 and 28) except the second observation
Table 6. The corresponding multivariate control charts as outliers. The classical chart identified only two points,
are as shown in Figures 2a through 2d for Classical, observations 24 and 28 as outlying points, while the pro-
MVE, MCD and PRM, respectively. From Table 2, while posed robust chart identified all the outlying points as
the Classical, MCD and MVE control charts indicated two outliers. Figures 4b and 4c are MVE and MCD control
points as outliers, the proposed robust method (PRM) charts respectively showing the four observations above
chart signals all the three observations (points 2, 14 and the upper control limits. Figures 4a and 4d are the control
24) as outliers. The classical chart indicated observations charts for Classical and PRM showing two and four
2 and 24 as outlying points while both MCD and MVE points above the upper control limit respectively.
indicated observations 14 and 24 as outliers. Figures 3a- Finally, the number of outlying points was further
3d gave clearer details. increased to 7 with the modification of observations 8 and
The number of outliers was increased to 5 by modifying 20 to (0.400, 50.550) and (0.715, 62.455) respectively.
observations 18 and 28 to (0.350, 53.180) and (0.410, Table 8 gave the multivariate statistics for all the four
50.470), respectively in addition to the three existing control charts. The Classical control chart’s performance,
outlying points in the data set. The multivariate statistics as shown in Figure 5a, was so poor that it can only
obtained from the four methods together with the data set identified only two observations as outlying points out of
008 Afr. J. Math. Comput. Sci. Res.
Table 8. Data set and Hotelling´s-T2 statistic using the Classical, MCD, MVE and the Proposed
Robust Method (PRM) when there are seven (7) outliers
seven outliers in the data. Both MVE and MCD control the number of outliers in a multivariate data set.
charts given in Figures 5b and 5c, performed better by The proposed robust method of estimating the
identifying all the outlying observations except the second variance-covariance matrix of multivariate data combines
observation as outliers. The PRM chart performed as the efficiencies of both classical and existing robust
such by identifying all the seven outliers in the data set methods (MVE and MCD) of estimation. The classical
except the twentieth observation. method of estimation is most efficient in multivariate ana-
lysis when there is no or only a single outlier while on the
other hand the existing robust methods (MVE and MCD)
SUMMARY AND CONCLUSION are more efficient in the presence of multiple outliers in a
multivariate data set.
The proposed robust method empirically compete favour- The proposed robust method (PRM) performed better
ably well with the most widely used robust methods (MVE and more efficient in the two extreme cases outlined
and MCD) in detecting outliers in the presence of multiple above. While existing robust methods are less efficient
outliers. The proposed method is better in detecting where there is no or only one outlier, the proposed robust
outliers than the MVE and MCD, especially when there method is better. Likewise when there are multiple
are fewer or single outliers in the data set. As a result of outliers, the classical method becomes less efficient while
the above points, the proposed method is highly recom- the proposed robust method was found to be efficient.
mended when there is no information as regards the Generally, since the information on whether a multiva-
Oyeyemi and Ipinyomi 009
1.0 0.4
0.8
0.3
S ig n a l p ro b a b ility
S ig n a l p ro b a b ility
0.6
MVE MVE
0.1
0.2 MCD MCD
Proposed Proposed
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
0.7
0.3
0.6
0.5
S ig n a l p r o b a b ilit y
0.2
S ig n a l p ro b a b ility
0.4
Control Chart Control Chart
0.3
Classical
Classical
0.1
0.2 MVE MVE
MCD MCD
0.1
Proposed Proposed
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Figure 1b. Signal prob. when there are 3 outliers. Figure 1d. Signal prob. when there are 7 outliers.
010 Afr. J. Math. Comput. Sci. Res.
14
12
Classical chart
8
UCL
6
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
70
60
Hotelling T- Square statistic
50
40
MVE chart
30
20 UCL
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
40
Fig. 4.5c: MCD chart when there is one outlier.
Hotelling T- Square statistic
30
MCD chart
20
UCL
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
Figure 2c. MCD chart when there is one outlier.
80 Proposed chart
60
Proposed chart
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
16
14
12
10
Hotelling T- Square statistic
8 Classical chart
UCL
6
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
Figure 3a. Classical chart when there are 3 outliers.
300
Hotelling T- Square statistic
200
MVE chart
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
200
100
MCD chart
UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
140
120
Hotelling´s T-square statistic
100
Proposed chart
80
UCL for PRM
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
167
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
20
Classical chart
10
UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
Hotelling´s T-square statistic
200
MVE chart
UCL
100
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
200
MCD chart
UCL
100
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
Hotelling´s T-square statistic
200
Proposed chart
100 UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
Figure 4d. PRM chart when there are 5 outliers.
016 Afr. J. Math. Comput. Sci. Res.
16
14
Hotelling T- Square statistic
12
10
8 Classical chart
6 UCL
0
1
2
3
4
5
6
7
8
9
1
101
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
Hotellings´ T-square statistic
200
MVE chart
100
UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
200
MCD chart
100
UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
300
Hotellings´ T-Square statistic
200
UCL
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Observation
Figure 5d. PRM chart when there are 7 outliers.
riate data set contains outliers or not and even the num- It is highly recommended to use the proposed robust
ber of outliers, may not be at the disposal of the analyst. method in estimating the variance-covariance since it will
018 Afr. J. Math. Comput. Sci. Res.
combine both efficiencies of both classical and other Rocke DM, Woodruff DL (1998). Robust Estimation of Multivariate
Location and Shape. J. Stat. Plan. Inference 57: 245 – 255.
robust methods in the presence or otherwise of multiple
Rousseeuw PJ (1984). Least Median of Squares Regression. J. Amer.
outliers. Extension of further research to analytical Stat. Assoc. 79: 871-880.
approach has also been opened. Rousseeuw PJ, Van Zomeren BC (1990). Unmasking Multivariate
Outliers and Leverage Points. J. Am. Stat. Assoc. 85: 633-639.
Rousseeuw PJ, Van Zomeren BC (1991). Unmasking Multivariate
REFERENCES Outliers and Leverage Points. Rejoinder- J. Am. Stat. Assoc. 85: 648-
651.
Agullo J (1996). Exact Iterative Computation of the Multivariate Titterington DM (1975). Optimal Design; Some Geometrical Aspect of
Minimum Volume Ellipsoid with a Branch and Bound Algorithm. In D-Optimality. Biometrika 62(3): 313 – 320.
Proceedings in Computational Statistics, pp.175-180. Vargas JA (2003). Robust Estimation in Multivariate Control Charts for
Daves PL (1987). Asymptotic Behavior of S-Estimates of Multivariate Individual Observations. J. Qual. Technol. 35(4): 367-376.
Location Parameters and Dispersion Matrices. Ann. Stat. 15: 1269-
1292.
Hubert M, Engelen S (2007). Fast Cross-validation of High-breakdown
Re-sampling Methods for PCA. Comput. Stat. Data Anal. 51(10):
5013 – 5024.
Jensen WA, Birch JB, Woodall WH (2002). High Breakdown Estimation
Methods for Phase I Multivariate Control Charts.
www.stat.vt.edu/newsletter/annual_report_statistics_2008.pdf.
Lopuhaa HP, Rousseeuw PJ (1991). Breakdown Point of Affine
Equivariant Estimators of Location and Variance Matrices. Ann. Stat.
27: 125 – 138.
Vandev DL, Neykov NM (2000). Robust Maximum Likelihood in the
Gaussian Case. http://www.fmi.uni-sofia.bg/fmi/statist/Personal/
Vandev/papers/ascona_1992.pdf.
Quesenberry CP (2001). The Multivariate Short-Run Snapshot Q Chart.
Qual. Eng. 13: 679 – 683.