Você está na página 1de 11

A FRAMEWORK FOR UNSUPERVISED OUTLIER

ENSEMBLE DETECTION WITH APPLICATIONS


TO THE NIGERIAN FOOTBALL LEAGUE
Uduak Akpan1 , Mary I. Akinyemi2
Department of Mathematics/Statistics,University of Calabar, CRS, Nigeria1
Department of Mathematics,University of Lagos, Lagos, Nigeria2

Abstract
Outlier analysis is an exciting aspect of science - finding something totally new, unique and
unexpected can lead to a significant scientific discovery or make ones career. This research
work addresses the problem of detecting these unusual instances (outliers) in data that are
either erroneous, or may present special/unique cases in the dataset, which can be interesting
for gaining new insights into the observed domain.
In professional sports, especially in Nigeria, the decision to recruit players/athletes into
the National team is made purely by instincts. Instincts are important, but they may not be
enough to make good decisions consistently. We demonstrate that, outlier detection analysis
can supplement instinct with evidence rooted in data - by recognizing players that stand out
or have exceptional skills. Hence, an unsupervised ensemble-based outlier detection method
is constructed by unifying outputs from three (3) outlier detection methods, Local outlier
factor (LOF), Angle-based outlier degree (ABOD) and Subspace outlier degree (SOD) via
Regularization and Gaussian scaling. We also present a heuristic framework for prediction
and quantitative performance evaluation of the ensemble. The Ensemble is applied to the
Nigerian football players performance statistics data, the detected outlier instances were
qualitatively evaluated by a sports analyst confirming the usefulness of the proposed frame-
work in identifying even the unexpected instances as well as unusual special cases.

Keywords: Outliers, Outlier detection, Ensemble techniques, Random Forests, Unsupervised


prediction.

1
1 INTRODUCTION
In many data analysis tasks, a large number of variables are being recorded or sampled.
One of the first steps towards obtaining a coherent analysis is the detection of out-laying
observations. Although outliers are often considered as an error or noise, they may carry
important information. Outliers can be due to several causes, the measurement can be
incorrectly observed, recorded, or the observed datum can come from a different population
with respect to the normal situation and thus is correctly measured but represents a rare
/special event. Outlier detection methods aim to automatically identify those valuable or
disturbing observations in collections of data.
The oldest methods for outlier detection are rooted in probabilistic and statistical models,
and date back to the nineteenth century (Edgeworth, 1887). The most basic form of outlier
detection is extreme univariate analysis. In such cases, it is desirable to determine data
values at the tails of univariate distributions, along with a corresponding level of statistical
significance. In general, outlier detection techniques can be classified into three main cate-
gories, namely supervised, unsupervised and semi-supervised techniques based on whether
or not the response variable is labelled. These categories can be further subdivided into sta-
tistical approaches, proximity-based approaches, clustering-based approaches, classification
approaches or ensemble based approaches.
Statistical outlier detection techniques are based on the assumption that inlier data in-
stances occur in high probability regions of a stochastic model, while outliers occur in the
low probability regions of the stochastic model. For example the Grubbs test (Grubbs, 1969;
Anscombe & Guttman, 1960), kernel density estimation (Desforges, Jacob, & Cooper, 1998)
etc. Classification based outlier detection techniques operate in a two-phase fashion. The
training phase learns a classifier using the available labeled training data. The testing phase
classifies a test instance as an outlier or inlier using the classifier. Any base classification
method can be used provided it is able to output some indication of its confidence in the
predictions (Aggarwal, 2013). Proximity-based approaches assume that the proximity of an
outlier object to its nearest neighbours significantly deviates from the proximity of the object
to most of the other objects in the data set, they include, LoOP (Local Outlier Probability)
outlier detection method (Kriegel, Kroger, Schubert, & Zimek, 2009) and distance-based
outlier detection (Aggarwal & Yu, 2001; Breunig, Kriegel, Ng, & Sander, 2000; Zhang, Hut-
ter, & Jin, n.d.). Clustering-based approaches detect outliers by examining the relationship
between objects and clusters. Intuitively, an outlier is an object that belongs to a small and
remote cluster, or does not belong to any cluster. Some examples of this approach include
SNN clustering (Ertoz, Steinbach, & Kumar, 2003), K-means Clustering (Smith, Bivens,

2
Embrechts, Palagiri, & Szymanski, 2002) etc.

1.1 OUTLIER ENSEMBLES


The main idea behind the ensemble methodology is to weigh several individual data analysis
techniques, and combine them in order to obtain a technique that results in significant
improvement from the base methods (Polikar, 2009). The history of ensemble methods
dates back to as early as 1977 with Tukeys Twicing (Tukey, 1977). The work on ensembles
for outlier detection exist but is often scattered in literature, and in comparison to other
problems such as classification, not as well formalized. Outlier ensemble methods can be
categorized into sequential ensembles, independent ensembles, model centered-ensembles and
data-centered ensembles.
In sequential ensembles, one or more outlier detection methods are applied sequentially
to either all or portions of the data. Thus, depending upon the approach, either the data set
or the method may be changed in sequential executions (Aggarwal, 2013). A classic example
of this is applied for cluster-based outlier analysis (for constructing more robust clusters in
later stages) (Barbara, Li, Couto, Lin, & Jajodia, 2003). In independent ensembles, different
instantiations of the method or different portions of the data are used for outlier analysis.
Alternatively, the same method may be applied, but with either a different initialization,
parameter set or even random seed in the case of a randomized algorithms. The results
can be combined in order to obtain a more robust outlier score. For example, the methods
in Lazarevic & Kumar, 2005 and Liu, Ting, & Zhou, 2008, sample subspaces from the
underlying data in order to determine outliers from each of these executions independently.
In data-centered ensembles, different parts, samples or functions of the data are explored in
order to perform the analysis. The core idea is that each part of the data provides a specific
kind of insight, and by using an ensemble over different portions of the data, it is possible
to obtain different insights. One of the earliest data-centered ensembles was discussed in
Lazarevic & Kumar, 2005. Model centered ensembles attempt to combine the outlier scores
from different models built on the same data set. The major challenge of this model is
that the scores from different models are often not directly comparable to one another. For
example, the outlier score from a k-nearest neighbour approach is very different from the
outlier score provided by an angle-based detection model. This causes issues in combining
the scores from these different outlier models. Therefore, it is critical to be able to convert
the different outlier scores into normalized values which are directly comparable, and also
preferably interpretable, such as a probability (Zimek, Campello, & Sander, 2013). The
broad concept of decision trees can also be extended to outlier analysis by examining those

3
paths with unusually short length, since the outlier regions tend to get isolated rather quickly.
An ensemble of such trees is referred to as an isolation forest (Liu et al., 2008) and has been
used effectively for making robust predictions about outliers.

2 METHODS
2.1 Framework for Outlier Ensemble Detection
The goal is to develop a framework that would enable the detection of outlaying data in-
stances, evaluate the performance of this model in collaboration with a domain expert and
build a predictor for the classes. The design of this framework is as follows:

Apply baseline outlier detection methods to the data which will return a set of suspi-
cious instances with their corresponding outlier scores.

Build ensemble by transforming, unifying and combining the outlier scores from the
different methods.

Present result from ensemble to a domain expert/s; the domain expert inspects the
detected outlier instances and decides whether they are interesting outliers which lead
to new insights in domain understanding, erroneous instances which should be removed
from the data, false alarms (regular instances) and/or instances with minor corrected
errors to be reintroduced into the dataset.

Label data using results from (3) above and build a classifier to assess the performance
of the ensemble by its ability to improve the performance of the classifier compared to
classifiers built using individual outlier methods that constitute the ensemble.

Use the model built for future predictions.

2.2 Random Forests


Random forests (Breiman, 2001) is a non-parametric method for classification/regression. A
Random forest consists of a collection or ensemble of simple decision trees, each capable of
producing a response when presented with a set of predictor values. Random forest grows
each tree on an independent bootstrap sample from the training data. But when building
these decision trees, each time a split in a tree is considered, a random sample of m predictors

is chosen as split candidates from the full set of p predictors and typically we choose m p.
Next, the best split on the selected m variables is found and the trees grown to maximum

4
depth. At each bootstrap iteration, the predicted class of an observation is calculated by a
majority vote of the data not in the bootstrap sample (what Breiman calls out-of-bag or
OOB observations) using the tree grown with the bootstrap sample.

2.3 Local Outlier Factor (LOF)


Breunig et al., 2000, proposed this density-based approach to find an outlier. For each point
p in the given dataset, they evaluated its local outlier factor (LOF); a point whose LOF
is large is declared as an outlier. The method is as follows: For a given data point X,
let Dk (X) be its distance to the k-nearest neighbour of X, the distance measure being the
Euclidean distance and let Lk (X) be the set of points within the k-nearest neighbour distance
of X. Then, the reachability distance Rk (X, Y ) of object X with respect to Y is defined as

Rk (X, Y ) = Max{dist(X, Y ), Dk (Y )} (2.3.1)

The average reachability distance ARk (X) of data point X with respect to its neighbour-
hood Lk (X) is defined as:

ARk (X) = M EANY Lk (X) Rk (X, Y ) (2.3.2)

The Local Outlier Factor is then simply:

ARk (X)
LOFk (X) = M EANY Lk (X) (2.3.3)
ARk (Y )

2.4 Angle-based Outlier Degree (ABOD)


Kriegel, Schubert, & Zimek, 2008, proposed this method. The idea in angle-based methods
is that data points at the boundary of the data are likely to enclose the entire data within a
smaller angle whereas, points in the interior are likely to have data around them at different
angles. Consider three data points X, Y and Z. Then, the angle between the vectors Y X
and Z X, will not vary much for different values of Y and Z, when X is an outlier.
Furthermore, the angle is inverse weighted by the distance between the points (Aggarwal,
2013). The corresponding angle (weighted cosine) is defined as follows:

h (Y X) (Z X) i
W Cos(Y X, Z X) = (2.4.1)
k Y X k22 k Z X k22
where, k k2 represents L2 -norm and h i represents scalar product. Then the angle-
based outlier degree (ABOD) of the data point X is defined as follows:

5
ABOD(X) = V arY,Z W Cos(Y X, Z X) (2.4.2)

Data points which are outliers will have a smaller spectrum of angles and will therefore
have lower values of the angle-based outlier degree.

2.5 Subspace Outlier Degree (SOD)


This is a method for finding outliers in lower dimensional projections of the data, proposed
in Kriegel, Schubert, & Zimek, 2009. In this approach, a local analysis is provided specific to
each data point. For each data point X, a set of reference points S(X) are determined, which
represent the proximity of the current data point being examined. Once this reference set
S(X) has been determined, the relevant subspace for S(X) is determined as the set Q(X) of
dimensions in which the variance is small. The Euclidean distance of X is computed to the
mean of the reference set S(X) in the subspace defined by Q(X). This is denoted by G(X).
The subspace outlier degree SOD(X) of a data point is defined by normalizing this distance
G(X) by the number of dimensions in Q(X) (Aggarwal, 2013).

G(X)
SOD(X) = (2.5.1)
| Q(X) |

2.6 SCORE UNIFICATION


The score unification method used in this work was originally proposed by Kriegel, Schubert,
and Zimek in 2011 but with little modifications to accommodate ABOD scores that are equal
to zero. This unification method contains two steps which are, regularization and scaling.
Regularization is used to transform the original outlier scores to [0, ] interval, and
the smaller the transformed score, the more likely it is to be an inlier.
For LOF the transformation is done as follows: Let the expected inlier value be E, then
for LOF E = 1 (Breunig et al., 2000) and the outlier score for an object o be S(o), then the
regularized score is given by

Regs (o) = max{ 0, S(o) E} (2.6.1)

The modified regularized ABOD score is formulated as follows:


logarithmic inversion is used since its scores have low contrast and the logarithmic inversion
addresses the enhancement of contrast between inliers and outliers. Let the outlier score
for an object o be S(o) and the maximum possible (or observed) score be Smax , then the
regularized score is given by:

6
 
S(o) +
RegS = log (2.6.2)
Smax +
where 0 < 1, which in this work was set to 1e-10.
For SOD the scores are already in the range [0, ], so no need for transformation.

Scaling is then performed to transform the scores to a range [0, 1]. Given the mean S
and the standard deviation S of the set of regularized values using the outlier score S(o),
this method uses the normal CDF and the Gaussian error function, erf() to transform
the outlier score into a probability value:
  
S(o) S
NormS (o) = Max 0, erf (2.6.3)
S 2

where, erf(x) = 2(x. 2) 1. Then, the scores from the different methods are combined
using the averaging function(mean).

2.7 EVALUATION MEASURES


We used standard classification performance metrics, sensitivity, specificity, Accuracy, and
area under the Receiver Operating Characteristic(ROC) curve (AUC) to evaluate perfor-
mance.

3 DATA
The Nigerian football players statistics data from 1997 - July 2015 used for this research was
retrieved from the online database of the Association of Football Statisticians (AFS) (AFS,
2015). The instances are records of 314 players consisting of the following variables: name of
player, number of appearances, number of substitutions, number of goals scored, number of
penalties scored, position of the player on the field (goalkeeper, defender, midfielder, forward)
and the number of red and yellow cards received.

Goalkeepers Forwards Midfielders Defenders


Count 17 100 100 97
Max. number of Appearances/player 98 60 67 95
Max. number of Goals Scored/player 0 21 12 5
No. of Players who scored Goals 0 62 34 18
No. of Players who scored penalties 0 10 8 0

Table 1: Summaries by Position on the field

7
Table 1, shows separate summaries for all positions on the field. It can be seen here
that there are 17 goalkeepers, 97 defenders, 100 midfielders and 100 forwards. The player
with the most number of appearances is a goalkeeper. It was further observed that forwards
scored most goals, with players scoring up to 21 goals and both goalkeepers and defenders
scored no penalty.

Figure 1: Distribution of Yellow cards by Figure 2: Distribution of Red cards by


Position on the field. Position on the field.

Furthermore, it was observed that defenders received more yellow/red cards, which should
be expected because a defender is an outfield player whose primary role is to prevent the
opposing team from scoring. Goalkeepers received the lowest number of yellow cards and
received no red card.(see Figure 1 and Figure 2)

4 Results
4.1 OUTLIER DETECTION ANALYSIS VIA THE ENSEMBLE
The methods LOF, ABOD and SOD were first applied to the data and then the ENSEMBLE
is built by combining the outlier scores from individual methods using approaches described
in section 2. The parameter settings for LOF; k = 30:50 , SOD ; alpha = 0.8 are set
as suggested by the various authors, and for ABOD the percentage of data to used when
calculating the angle variance was set to 0.5 due to computational time constraint. In order
to declare points as outliers a threshold setting equal to 0.1 is used.

8
4.2 EXPERT INTERPRETATION OF RESULTS OBTAINED
BY THE ENSEMBLE
The Sport analyst/journalist was shown outliers detected by the Ensemble, to confirm the
meaningfulness of the outliers found. In the qualitative evaluation he used all the available
retrospective data for these players to identify why they were detected as special cases.
The analyst agreed with about 98% of the results from the ENSEMBLE, in particular, he
identified two players as being false positives. Hence, they were not labelled as outliers in
further analysis done.

4.3 CLASS LABELLING AND CLASSIFICATION


For each outlier identified, a new data frame is created to include a new variable named
Label to show whether a data instance is an outlier or inlier. Random forest is applied to
do the classification on the updated data frame. Optimization is performed to achieve the
best performance measures on the out-of-bag samples. The results of the analysis is shown
in Table 2.

Method AUC (%) Sensitivity (%) Specificity (%) Accuracy (%)


LOF 97.4 70.8 98.6 91.5
ABOD 100 100 97.3 97.9
SOD 97.3 85 96 93.6
ISO 99.9 100 98.6 98.9
ENSEMBLE 99.9 90.9 100 97.9

Table 2: Showing performance results on football data

5 Summary, Empirical Findings and Further Work


5.1 Summary
The main aim of this work was to develop a framework for unsupervised outlier ensemble
detection and a heuristic approach for prediction and quantitative performance evaluation
of the ensemble. The unsupervised ensemble-based outlier detection technique was applied
to a real-life data, thus affirming the practical applicability of the developed framework.

9
5.2 Empirical Findings
The experimental results presented in this paper show that outlier ensemble analysis can
identify meaningful outliers in data which can be of immense contribution to the process
of selecting players in the professional sports. The idea of outlier ensembles can clearly
been seen from the results. Results from the analysis of Nigerian football players statistics
data showed that ISO performed best with very high performance measure values. The
ENSEMBLE did not perform badly it made correct predictions 97% of the time.

5.3 Future Work


There are two directions for future work. The first one is on how to describe or explain
why the identified outliers are exceptional and which of the variables make them so. This
is particularly important for high-dimensional datasets, because an outlier may be outlying
only on some, but not on all dimensions. Secondly, the development of an approach to a
user-guided outlier detection where the user can choose which methods to include in the
outlier ensemble construction stage from a collection of both supervised and unsupervised
methods.

References
AFS. (2015). Nigeria national football team statistics and records: all-time record. Associ-
ation of Football Statisticians, http://www.11v11.com/about-the-association-of
-football-statisticians-afs-/.
Aggarwal, C. (2013). Outlier analysis. Springer.
Aggarwal, C., & Yu, P. (2001). Outlier detection for high dimensional data. In Proceedings
of the ACM SIGMOD Conference on Management of Data (p. 37-46).
Anscombe, F., & Guttman, I. (1960). Rejection of outliers. Technometrics, 2 (2), 123-147.
Barbara, D., Li, Y., Couto, J., Lin, J.-L., & Jajodia, S. (2003). Bootstrapping a data mining
intrusion detection system. Symposium on Applied Computing.
Breiman, L. (2001). Random forests. Machine Learning, 45 (1), 5-32.
Breunig, M., Kriegel, H.-P., Ng, R., & Sander, J. (2000). Lof: Identifying density-based
local outliers. In Proceedings of the ACM SIGMOD Conference on Management of
Data (p. 93-104). Dallas, TX.
Desforges, M., Jacob, P., & Cooper, J. (1998). Applications of probability density estimation
to the detection of abnormal conditions in engineering. In Proceedings of Institute of
Mechanical Engineers (p. 687-703).

10
Edgeworth, F. Y. (1887). On discordant observations. Philosophical Magazine, 25 (5),
364-375.
Ertoz, L., Steinbach, M., & Kumar, V. (2003). Finding topics in collections of documents:
A shared nearest neighbour approach. In Clustering and Information Retrieval (p. 83-
104).
Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technomet-
rics, 11 (1), 1-21.
Kriegel, H.-P., Kroger, P., Schubert, E., & Zimek, A. (2009). Loop: Local outlier proba-
bilities. In Proceedings of the 18th ACM Conference on Information and Knowledge
Management (CIKM) (p. 1649-1652). Hong Kong, China.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2008). ABOD: Angle-based outlier detection in
high-dimensional data. In Proceedings of the 14th ACM International Conference on
Knowledge Discovery and Data Mining (SIGKDD) (p. 444-452). Las Vegas, NV.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2009). Outlier Detection in Axis-Parallel Sub-
spaces of High Dimensional Data. In Proceedings of the 13th PAKDD Conference.
Bangkok, Thailand.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2011). Interpreting and unifying outlier scores.
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM)
(p. 13-24). Mesa, AZ.
Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In ACM KDD
Conference.
Liu, F., Ting, K., & Zhou, Z.-H. (2008). Isolation forest. In ICDM Conference.
Polikar, R. (2009). Ensemble learning. Scholarpedia, 4 , 2776.
Smith, R., Bivens, A., Embrechts, M., Palagiri, C., & Szymanski, B. (2002). Clustering
approaches for anomaly based intrusion detection. In Proceedings of Intelligent Engi-
neering Systems through Artificial Neural Networks (p. 579-584). ASME.
Tukey, J. (1977). Exploratory data analysis. Addison-Wesley.
Zhang, K., Hutter, M., & Jin, H. (n.d.). A new local distance based outlier detection approach
for scattered real world data. In Proceedings of the 13th PAKDD conference.
Zimek, R. J., Campello, G., & Sander, J. (2013). Ensembles for unsupervised outlier
detection: Challenges and research questions. SIGKDD Explorations, 15 (1).

11

Você também pode gostar