Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
Outlier analysis is an exciting aspect of science - finding something totally new, unique and
unexpected can lead to a significant scientific discovery or make ones career. This research
work addresses the problem of detecting these unusual instances (outliers) in data that are
either erroneous, or may present special/unique cases in the dataset, which can be interesting
for gaining new insights into the observed domain.
In professional sports, especially in Nigeria, the decision to recruit players/athletes into
the National team is made purely by instincts. Instincts are important, but they may not be
enough to make good decisions consistently. We demonstrate that, outlier detection analysis
can supplement instinct with evidence rooted in data - by recognizing players that stand out
or have exceptional skills. Hence, an unsupervised ensemble-based outlier detection method
is constructed by unifying outputs from three (3) outlier detection methods, Local outlier
factor (LOF), Angle-based outlier degree (ABOD) and Subspace outlier degree (SOD) via
Regularization and Gaussian scaling. We also present a heuristic framework for prediction
and quantitative performance evaluation of the ensemble. The Ensemble is applied to the
Nigerian football players performance statistics data, the detected outlier instances were
qualitatively evaluated by a sports analyst confirming the usefulness of the proposed frame-
work in identifying even the unexpected instances as well as unusual special cases.
1
1 INTRODUCTION
In many data analysis tasks, a large number of variables are being recorded or sampled.
One of the first steps towards obtaining a coherent analysis is the detection of out-laying
observations. Although outliers are often considered as an error or noise, they may carry
important information. Outliers can be due to several causes, the measurement can be
incorrectly observed, recorded, or the observed datum can come from a different population
with respect to the normal situation and thus is correctly measured but represents a rare
/special event. Outlier detection methods aim to automatically identify those valuable or
disturbing observations in collections of data.
The oldest methods for outlier detection are rooted in probabilistic and statistical models,
and date back to the nineteenth century (Edgeworth, 1887). The most basic form of outlier
detection is extreme univariate analysis. In such cases, it is desirable to determine data
values at the tails of univariate distributions, along with a corresponding level of statistical
significance. In general, outlier detection techniques can be classified into three main cate-
gories, namely supervised, unsupervised and semi-supervised techniques based on whether
or not the response variable is labelled. These categories can be further subdivided into sta-
tistical approaches, proximity-based approaches, clustering-based approaches, classification
approaches or ensemble based approaches.
Statistical outlier detection techniques are based on the assumption that inlier data in-
stances occur in high probability regions of a stochastic model, while outliers occur in the
low probability regions of the stochastic model. For example the Grubbs test (Grubbs, 1969;
Anscombe & Guttman, 1960), kernel density estimation (Desforges, Jacob, & Cooper, 1998)
etc. Classification based outlier detection techniques operate in a two-phase fashion. The
training phase learns a classifier using the available labeled training data. The testing phase
classifies a test instance as an outlier or inlier using the classifier. Any base classification
method can be used provided it is able to output some indication of its confidence in the
predictions (Aggarwal, 2013). Proximity-based approaches assume that the proximity of an
outlier object to its nearest neighbours significantly deviates from the proximity of the object
to most of the other objects in the data set, they include, LoOP (Local Outlier Probability)
outlier detection method (Kriegel, Kroger, Schubert, & Zimek, 2009) and distance-based
outlier detection (Aggarwal & Yu, 2001; Breunig, Kriegel, Ng, & Sander, 2000; Zhang, Hut-
ter, & Jin, n.d.). Clustering-based approaches detect outliers by examining the relationship
between objects and clusters. Intuitively, an outlier is an object that belongs to a small and
remote cluster, or does not belong to any cluster. Some examples of this approach include
SNN clustering (Ertoz, Steinbach, & Kumar, 2003), K-means Clustering (Smith, Bivens,
2
Embrechts, Palagiri, & Szymanski, 2002) etc.
3
paths with unusually short length, since the outlier regions tend to get isolated rather quickly.
An ensemble of such trees is referred to as an isolation forest (Liu et al., 2008) and has been
used effectively for making robust predictions about outliers.
2 METHODS
2.1 Framework for Outlier Ensemble Detection
The goal is to develop a framework that would enable the detection of outlaying data in-
stances, evaluate the performance of this model in collaboration with a domain expert and
build a predictor for the classes. The design of this framework is as follows:
Apply baseline outlier detection methods to the data which will return a set of suspi-
cious instances with their corresponding outlier scores.
Build ensemble by transforming, unifying and combining the outlier scores from the
different methods.
Present result from ensemble to a domain expert/s; the domain expert inspects the
detected outlier instances and decides whether they are interesting outliers which lead
to new insights in domain understanding, erroneous instances which should be removed
from the data, false alarms (regular instances) and/or instances with minor corrected
errors to be reintroduced into the dataset.
Label data using results from (3) above and build a classifier to assess the performance
of the ensemble by its ability to improve the performance of the classifier compared to
classifiers built using individual outlier methods that constitute the ensemble.
4
depth. At each bootstrap iteration, the predicted class of an observation is calculated by a
majority vote of the data not in the bootstrap sample (what Breiman calls out-of-bag or
OOB observations) using the tree grown with the bootstrap sample.
The average reachability distance ARk (X) of data point X with respect to its neighbour-
hood Lk (X) is defined as:
ARk (X)
LOFk (X) = M EANY Lk (X) (2.3.3)
ARk (Y )
h (Y X) (Z X) i
W Cos(Y X, Z X) = (2.4.1)
k Y X k22 k Z X k22
where, k k2 represents L2 -norm and h i represents scalar product. Then the angle-
based outlier degree (ABOD) of the data point X is defined as follows:
5
ABOD(X) = V arY,Z W Cos(Y X, Z X) (2.4.2)
Data points which are outliers will have a smaller spectrum of angles and will therefore
have lower values of the angle-based outlier degree.
G(X)
SOD(X) = (2.5.1)
| Q(X) |
6
S(o) +
RegS = log (2.6.2)
Smax +
where 0 < 1, which in this work was set to 1e-10.
For SOD the scores are already in the range [0, ], so no need for transformation.
Scaling is then performed to transform the scores to a range [0, 1]. Given the mean S
and the standard deviation S of the set of regularized values using the outlier score S(o),
this method uses the normal CDF and the Gaussian error function, erf() to transform
the outlier score into a probability value:
S(o) S
NormS (o) = Max 0, erf (2.6.3)
S 2
where, erf(x) = 2(x. 2) 1. Then, the scores from the different methods are combined
using the averaging function(mean).
3 DATA
The Nigerian football players statistics data from 1997 - July 2015 used for this research was
retrieved from the online database of the Association of Football Statisticians (AFS) (AFS,
2015). The instances are records of 314 players consisting of the following variables: name of
player, number of appearances, number of substitutions, number of goals scored, number of
penalties scored, position of the player on the field (goalkeeper, defender, midfielder, forward)
and the number of red and yellow cards received.
7
Table 1, shows separate summaries for all positions on the field. It can be seen here
that there are 17 goalkeepers, 97 defenders, 100 midfielders and 100 forwards. The player
with the most number of appearances is a goalkeeper. It was further observed that forwards
scored most goals, with players scoring up to 21 goals and both goalkeepers and defenders
scored no penalty.
Furthermore, it was observed that defenders received more yellow/red cards, which should
be expected because a defender is an outfield player whose primary role is to prevent the
opposing team from scoring. Goalkeepers received the lowest number of yellow cards and
received no red card.(see Figure 1 and Figure 2)
4 Results
4.1 OUTLIER DETECTION ANALYSIS VIA THE ENSEMBLE
The methods LOF, ABOD and SOD were first applied to the data and then the ENSEMBLE
is built by combining the outlier scores from individual methods using approaches described
in section 2. The parameter settings for LOF; k = 30:50 , SOD ; alpha = 0.8 are set
as suggested by the various authors, and for ABOD the percentage of data to used when
calculating the angle variance was set to 0.5 due to computational time constraint. In order
to declare points as outliers a threshold setting equal to 0.1 is used.
8
4.2 EXPERT INTERPRETATION OF RESULTS OBTAINED
BY THE ENSEMBLE
The Sport analyst/journalist was shown outliers detected by the Ensemble, to confirm the
meaningfulness of the outliers found. In the qualitative evaluation he used all the available
retrospective data for these players to identify why they were detected as special cases.
The analyst agreed with about 98% of the results from the ENSEMBLE, in particular, he
identified two players as being false positives. Hence, they were not labelled as outliers in
further analysis done.
9
5.2 Empirical Findings
The experimental results presented in this paper show that outlier ensemble analysis can
identify meaningful outliers in data which can be of immense contribution to the process
of selecting players in the professional sports. The idea of outlier ensembles can clearly
been seen from the results. Results from the analysis of Nigerian football players statistics
data showed that ISO performed best with very high performance measure values. The
ENSEMBLE did not perform badly it made correct predictions 97% of the time.
References
AFS. (2015). Nigeria national football team statistics and records: all-time record. Associ-
ation of Football Statisticians, http://www.11v11.com/about-the-association-of
-football-statisticians-afs-/.
Aggarwal, C. (2013). Outlier analysis. Springer.
Aggarwal, C., & Yu, P. (2001). Outlier detection for high dimensional data. In Proceedings
of the ACM SIGMOD Conference on Management of Data (p. 37-46).
Anscombe, F., & Guttman, I. (1960). Rejection of outliers. Technometrics, 2 (2), 123-147.
Barbara, D., Li, Y., Couto, J., Lin, J.-L., & Jajodia, S. (2003). Bootstrapping a data mining
intrusion detection system. Symposium on Applied Computing.
Breiman, L. (2001). Random forests. Machine Learning, 45 (1), 5-32.
Breunig, M., Kriegel, H.-P., Ng, R., & Sander, J. (2000). Lof: Identifying density-based
local outliers. In Proceedings of the ACM SIGMOD Conference on Management of
Data (p. 93-104). Dallas, TX.
Desforges, M., Jacob, P., & Cooper, J. (1998). Applications of probability density estimation
to the detection of abnormal conditions in engineering. In Proceedings of Institute of
Mechanical Engineers (p. 687-703).
10
Edgeworth, F. Y. (1887). On discordant observations. Philosophical Magazine, 25 (5),
364-375.
Ertoz, L., Steinbach, M., & Kumar, V. (2003). Finding topics in collections of documents:
A shared nearest neighbour approach. In Clustering and Information Retrieval (p. 83-
104).
Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technomet-
rics, 11 (1), 1-21.
Kriegel, H.-P., Kroger, P., Schubert, E., & Zimek, A. (2009). Loop: Local outlier proba-
bilities. In Proceedings of the 18th ACM Conference on Information and Knowledge
Management (CIKM) (p. 1649-1652). Hong Kong, China.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2008). ABOD: Angle-based outlier detection in
high-dimensional data. In Proceedings of the 14th ACM International Conference on
Knowledge Discovery and Data Mining (SIGKDD) (p. 444-452). Las Vegas, NV.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2009). Outlier Detection in Axis-Parallel Sub-
spaces of High Dimensional Data. In Proceedings of the 13th PAKDD Conference.
Bangkok, Thailand.
Kriegel, H.-P., Schubert, E., & Zimek, A. (2011). Interpreting and unifying outlier scores.
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM)
(p. 13-24). Mesa, AZ.
Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. In ACM KDD
Conference.
Liu, F., Ting, K., & Zhou, Z.-H. (2008). Isolation forest. In ICDM Conference.
Polikar, R. (2009). Ensemble learning. Scholarpedia, 4 , 2776.
Smith, R., Bivens, A., Embrechts, M., Palagiri, C., & Szymanski, B. (2002). Clustering
approaches for anomaly based intrusion detection. In Proceedings of Intelligent Engi-
neering Systems through Artificial Neural Networks (p. 579-584). ASME.
Tukey, J. (1977). Exploratory data analysis. Addison-Wesley.
Zhang, K., Hutter, M., & Jin, H. (n.d.). A new local distance based outlier detection approach
for scattered real world data. In Proceedings of the 13th PAKDD conference.
Zimek, R. J., Campello, G., & Sander, J. (2013). Ensembles for unsupervised outlier
detection: Challenges and research questions. SIGKDD Explorations, 15 (1).
11