Escolar Documentos
Profissional Documentos
Cultura Documentos
_
Data Science Stack Exchange is a Here's how it works:
question and answer site for Data
science professionals, Machine
Learning specialists, and those
interested in learning more about the
field. Join them; it only takes a minute:
Anybody can ask Anybody can The best answers are voted
a question answer up and rise to the top
Join
Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best
understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of various parameters. Each
parameter is numeric (float) and there is a non-trivial/non-error value for each time point. My task is to monitor said log files for anomaly
detection (spikes, falls, unusual patterns with some parameters being out of sync, strange 1st/2nd/etc. derivative behavior, etc.).
On a similar assignment, I have tried Splunk with Prelert, but I am exploring open-source options at the moment.
Constraints: I am limiting myself to Python because I know it well, and would like to delay the switch to R and the associated learning curve.
Unless there seems to be overwhelming support for R (or other languages/software), I would like to stick to Python for this task.
Also, I am working in a Windows environment for the moment. I would like to continue to sandbox in Windows on small-sized log files but can
move to Linux environment if needed.
1. Python or R for implementing machine learning algorithms for fraud detection. Some info here is helpful, but unfortunately, I am struggling
to find the right package because:
2. Twitter's "AnomalyDetection" is in R, and I want to stick to Python. Furthermore, the Python port pyculiarity seems to cause issues in
implementing in Windows environment for me.
3. Skyline, my next attempt, seems to have been pretty much discontinued (from github issues). I haven't dived deep into this, given how
little support there seems to be online.
4. scikit-learn I am still exploring, but this seems to be much more manual. The down-in-the-weeds approach is OK by me, but my
background in learning tools is weak, so would like something like a black box for the technical aspects like algorithms, similar to
Splunk+Prelert.
Problem Definition and Questions: I am looking for open-source software that can help me with automating the process of anomaly
detection from time-series log files in Python via packages or libraries.
5. Do such things exist to assist with my immediate task, or are they imaginary in my mind?
6. Can anyone assist with concrete steps to help me to my goal, including background fundamentals or concepts?
7. Is this the best StackExchange community to ask in, or is Stats, Math, or even Security or Stackoverflow the better options?
EDIT [2015-07-23] Note that the latest update to pyculiarity seems to be fixed for the Windows environment! I have yet to confirm, but should
be another useful tool for the community.
EDIT [2016-01-19] A minor update. I had not time to work on this and research, but I am taking a step back to understand the fundamentals
of this problem before continuing to research in specific details. For example, two concrete steps that I am taking are:
1. Starting with the Wikipedia articles for anomaly detection [https://en.wikipedia.org/wiki/Anomaly_detection ], understanding fully, and then
either moving up or down in concept hierarchy of other linked Wikipedia articles, such as [https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm ], and then to [https://en.wikipedia.org/wiki/Machine_learning ].
2. Exploring techniques in the great surveys done by Chandola et al 2009 "Anomaly Detection: A Survey"[http://www-
users.cs.umn.edu/~banerjee/papers/09/anomaly.pdf ] and Hodge et al 2004 "A Survey of Outlier Detection Methodologies"
[http://eprints.whiterose.ac.uk/767/1/hodgevj4.pdf ].
Once the concepts are better understood (I hope to play around with toy examples as I go to develop the practical side as well), I hope to
understand which open source Python tools are better suited for my problems.
h2o library not importing in this module. – user14945 Dec 23 '15 at 6:08
Your problem is ill defined. What constitutes an anomaly can have a lot of different meanings. Is it deviation of the
mean? Is it certain patterns of behaviour? Different methods apply in each case. You'll need to look into "outlier
detection" if the anomaly is deviation from the mean. If you are looking for specific patterns you'd be much better
served with a supervised learning algorithm such as neural networks. – Willem van Doesburg Aug 21 '16 at 20:40
5 Answers
Basic Way
Derivative! If the deviation of your signal from its past & future is high you most probably have
an event. This can be extracted by finding large zero crossings in derivative of the signal.
Statistical Way
Mean of anything is its usual, basic behavior. if something deviates from mean it means that
it's an event. Please note that mean in time-series is not that trivial and is not a constant but
changing according to changes in time-series so you need to see the "moving average"
instead of average. It looks like this:
The Moving Average code can be found here. In signal processing terminology you are
applying a "Low-Pass" filter by applying the moving average.
MOV = movingaverage(TimeSEries,5).tolist()
STD = np.std(MOV)
events= []
ind = []
for ii in range(len(TimeSEries)):
if TimeSEries[ii] > MOV[ii]+STD:
events.append(TimeSEries[ii])
Probabilistic Way
They are more sophisticated specially for people new to Machine Learning. Kalman Filter is a
great idea to find the anomalies. Simpler probabilistic approaches using "Maximum-Likelihood
Estimation" also work well but my suggestion is to stay with moving average idea. It works in
practice very well.
https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-
series
Thanks for your time, but please see my first bullet of "Resources"; I have reviewed this option, and looking for
something that meets my "Constraints". – ximiki Jul 23 '15 at 13:41
To reiterate, and perhaps be more blunt, using Twitter's AnomalyDetection package is NOT an option here: Please
read the "Constraints" section more carefully. I do not mean to denounce any sincere attempts to help on this, but the
question is strictly for Python-based packages. Therefore, future voters, PLEASE do not upvote this answer because
it is not usable option. I would recommending clearing the current 2 votes for this via downvoting but perhaps this is
unethical within the Stackexchange community and do not want to catch any flack. – ximiki Jul 29 '15 at 16:30
Again, I apologize to harp on this, but I am simply trying to make this question very clear and usable for others
encountering a similar problem, and don't want them to go on a wild goose chase. – ximiki Jul 29 '15 at 16:31
h2o has an anomaly detection module and traditionally the code is available in R.However
beyond version 3 it has similar module available in python as well,and since h2o is open
source it might fit your bill.
import sys
sys.path.insert(1,"../../../")
import h2o
train = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/train.csv.gz"))
test = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/test.csv.gz"))
predictors = range(0,784)
resp = 784
# 2) DETECT OUTLIERS
# anomaly app computes the per-row reconstruction error for the test data set
# (passing it through the autoencoder model and computing mean square error
(MSE) for each row)
test_rec_error = ae_model.anomaly(test)
# 3) VISUALIZE OUTLIERS
# Let's look at the test set points with low/median/high reconstruction errors.
# We will now visualize the original test set points and their reconstructions
obtained
# by propagating them through the narrow neural net.
# Convert the test data into its autoencoded representation (pass through
narrow neural net)
test_recon = ae_model.predict(test)
if __name__ == '__main__':
h2o.run_test(sys.argv, anomaly)
edited Jul 23 '15 at 16:09 answered Jul 22 '15 at 21:42
0xF
302 1 7
Thanks! I haven't considered this package yet - I will add it to the list of candidates. To clarify, when you say "beyond
version 3 it has similar module available in python as well", do you know if h2o's anomaly detection module (beyond
ver 3) is available in Python, or some other module? – ximiki Jul 23 '15 at 13:52
1 @ximik Well,I revisited the python documentation of their latest version 3.0.0.26(h2o-
release.s3.amazonaws.com/h2o/rel-shannon/26/docs-website/…) and it seems like h2o.anomaly is not yet available
unlike its R api.I've raised the question in their google
group(groups.google.com/forum/#!topic/h2ostream/uma3UdpanEI) and you can follow that. – 0xF Jul 23 '15 at 15:48
1 Well,h2o support group has answered the question and anomaly is available in python as well.An example is
available here. github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/… – 0xF Jul 23 '15 at 16:07
Perfect! thank you for investigating. i'll update this post with results. – ximiki Jul 23 '15 at 16:58
I assume the feature you use to detect abnormality is one row of data in a logfile. If so, Sklearn
is your good friend and you can use it as a blackbox. Check the tutorial of one-class SVM and
Novelty detection.
However, in case that your feature is an entire logfile, you need to first summarize it to some
feature of same dimension, and then apply Novealty detection.
I am currently on same stage like you. I am finding best option for anomaly detection, doing
some research.
What I have found is I think best matches your need and is better compare to what you have
seen. i.e., TwitterAnomalyDetection, SkyLine.
I have found better is Numenta's NAB (Numenta Anomaly Benchmark). It also have a very
good community support and for you plus point is its open source & developed in python. You
can add your algorithm in it.
If you find better option. please, tell me. I am also on the same path.
Best Luck!!
Thanks for the valuable info! I will definitely check this out. – ximiki Feb 4 '16 at 14:05
2 I just wanted to return and comment on how applicable NAB seems to my problem. The only drawback I can see is
that this is only for univariate (one column) time-series anomaly detection, but what about multivariate (many
columns)? Thank you for this suggestion, I am going to push it to the shortlist for solution candidates. – ximiki Jul 8
'16 at 15:36