Você está na página 1de 4

join this community tour help

_
Data Science Stack Exchange is a Here's how it works:
question and answer site for Data
science professionals, Machine
Learning specialists, and those
interested in learning more about the
field. Join them; it only takes a minute:
Anybody can ask Anybody can The best answers are voted
a question answer up and rise to the top
Join

Open source Anomaly Detection in Python

Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best
understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of various parameters. Each
parameter is numeric (float) and there is a non-trivial/non-error value for each time point. My task is to monitor said log files for anomaly
detection (spikes, falls, unusual patterns with some parameters being out of sync, strange 1st/2nd/etc. derivative behavior, etc.).

On a similar assignment, I have tried Splunk with Prelert, but I am exploring open-source options at the moment.

Constraints: I am limiting myself to Python because I know it well, and would like to delay the switch to R and the associated learning curve.
Unless there seems to be overwhelming support for R (or other languages/software), I would like to stick to Python for this task.

Also, I am working in a Windows environment for the moment. I would like to continue to sandbox in Windows on small-sized log files but can
move to Linux environment if needed.

Resources: I have checked out the following with dead-ends as results:

1. Python or R for implementing machine learning algorithms for fraud detection. Some info here is helpful, but unfortunately, I am struggling
to find the right package because:
2. Twitter's "AnomalyDetection" is in R, and I want to stick to Python. Furthermore, the Python port pyculiarity seems to cause issues in
implementing in Windows environment for me.
3. Skyline, my next attempt, seems to have been pretty much discontinued (from github issues). I haven't dived deep into this, given how
little support there seems to be online.
4. scikit-learn I am still exploring, but this seems to be much more manual. The down-in-the-weeds approach is OK by me, but my
background in learning tools is weak, so would like something like a black box for the technical aspects like algorithms, similar to
Splunk+Prelert.

Problem Definition and Questions: I am looking for open-source software that can help me with automating the process of anomaly
detection from time-series log files in Python via packages or libraries.

5. Do such things exist to assist with my immediate task, or are they imaginary in my mind?
6. Can anyone assist with concrete steps to help me to my goal, including background fundamentals or concepts?
7. Is this the best StackExchange community to ask in, or is Stats, Math, or even Security or Stackoverflow the better options?

EDIT [2015-07-23] Note that the latest update to pyculiarity seems to be fixed for the Windows environment! I have yet to confirm, but should
be another useful tool for the community.

EDIT [2016-01-19] A minor update. I had not time to work on this and research, but I am taking a step back to understand the fundamentals
of this problem before continuing to research in specific details. For example, two concrete steps that I am taking are:

1. Starting with the Wikipedia articles for anomaly detection [https://en.wikipedia.org/wiki/Anomaly_detection ], understanding fully, and then
either moving up or down in concept hierarchy of other linked Wikipedia articles, such as [https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm ], and then to [https://en.wikipedia.org/wiki/Machine_learning ].
2. Exploring techniques in the great surveys done by Chandola et al 2009 "Anomaly Detection: A Survey"[http://www-
users.cs.umn.edu/~banerjee/papers/09/anomaly.pdf ] and Hodge et al 2004 "A Survey of Outlier Detection Methodologies"
[http://eprints.whiterose.ac.uk/767/1/hodgevj4.pdf ].

Once the concepts are better understood (I hope to play around with toy examples as I go to develop the practical side as well), I hope to
understand which open source Python tools are better suited for my problems.

machine-learning data-mining python anomaly-detection

edited Jan 19 '16 at 20:19 asked Jul 22 '15 at 14:26


ximiki
132 1 1 11
I recommend these videos if you are just beginning Scikit: github.com/justmarkham/scikit-learn-videos – Harvey Jul 23
'15 at 21:30

h2o library not importing in this module. – user14945 Dec 23 '15 at 6:08

Your problem is ill defined. What constitutes an anomaly can have a lot of different meanings. Is it deviation of the
mean? Is it certain patterns of behaviour? Different methods apply in each case. You'll need to look into "outlier
detection" if the anomaly is deviation from the mean. If you are looking for specific patterns you'd be much better
served with a supervised learning algorithm such as neural networks. – Willem van Doesburg Aug 21 '16 at 20:40

5 Answers

Anomaly Detection or Event Detection can be done in different ways:

Basic Way

Derivative! If the deviation of your signal from its past & future is high you most probably have
an event. This can be extracted by finding large zero crossings in derivative of the signal.

Statistical Way

Mean of anything is its usual, basic behavior. if something deviates from mean it means that
it's an event. Please note that mean in time-series is not that trivial and is not a constant but
changing according to changes in time-series so you need to see the "moving average"
instead of average. It looks like this:

The Moving Average code can be found here. In signal processing terminology you are
applying a "Low-Pass" filter by applying the moving average.

You can follow the code bellow:

MOV = movingaverage(TimeSEries,5).tolist()
STD = np.std(MOV)
events= []
ind = []
for ii in range(len(TimeSEries)):
if TimeSEries[ii] > MOV[ii]+STD:
events.append(TimeSEries[ii])

Probabilistic Way

They are more sophisticated specially for people new to Machine Learning. Kalman Filter is a
great idea to find the anomalies. Simpler probabilistic approaches using "Maximum-Likelihood
Estimation" also work well but my suggestion is to stay with moving average idea. It works in
practice very well.

I hope I could help :) Good Luck!

edited Jul 22 '15 at 19:01 answered Jul 22 '15 at 18:55


Kasra Manshaei
1,446 3 30
Thank you for your efforts on the deep discussion. Though programming this doesn't seem too bad (quite interesting, I
may say, to deep dive into the algorithms), I am curious of packages that already are available. Do you know of
anything that exists that is simple to install? Note this is not the same as simple to implement, which I understand
cannot be guaranteed. If I can get my environment functional, I believe I can finesse it based on examples for my task.
– ximiki Jul 23 '15 at 13:47

Maybe this helps cause you mentioned about steady states:


https://github.com/twitter/AnomalyDetection

https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-
series

answered Jul 22 '15 at 19:52


Alexandru Daia
94 5

Thanks for your time, but please see my first bullet of "Resources"; I have reviewed this option, and looking for
something that meets my "Constraints". – ximiki Jul 23 '15 at 13:41

To reiterate, and perhaps be more blunt, using Twitter's AnomalyDetection package is NOT an option here: Please
read the "Constraints" section more carefully. I do not mean to denounce any sincere attempts to help on this, but the
question is strictly for Python-based packages. Therefore, future voters, PLEASE do not upvote this answer because
it is not usable option. I would recommending clearing the current 2 votes for this via downvoting but perhaps this is
unethical within the Stackexchange community and do not want to catch any flack. – ximiki Jul 29 '15 at 16:30

Again, I apologize to harp on this, but I am simply trying to make this question very clear and usable for others
encountering a similar problem, and don't want them to go on a wild goose chase. – ximiki Jul 29 '15 at 16:31

h2o has an anomaly detection module and traditionally the code is available in R.However
beyond version 3 it has similar module available in python as well,and since h2o is open
source it might fit your bill.

You can see an working example over here

import sys
sys.path.insert(1,"../../../")
import h2o

def anomaly(ip, port):


h2o.init(ip, port)

print "Deep Learning Anomaly Detection MNIST"

train = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/train.csv.gz"))
test = h2o.import_frame(h2o.locate("bigdata/laptop/mnist/test.csv.gz"))

predictors = range(0,784)
resp = 784

# unsupervised -> drop the response column (digit: 0-9)


train = train[predictors]
test = test[predictors]

# 1) LEARN WHAT'S NORMAL


# train unsupervised Deep Learning autoencoder model on train_hex
ae_model = h2o.deeplearning(x=train[predictors], training_frame=train,
activation="Tanh", autoencoder=True,
hidden=[50], l1=1e-5, ignore_const_cols=False,
epochs=1)

# 2) DETECT OUTLIERS
# anomaly app computes the per-row reconstruction error for the test data set
# (passing it through the autoencoder model and computing mean square error
(MSE) for each row)
test_rec_error = ae_model.anomaly(test)

# 3) VISUALIZE OUTLIERS
# Let's look at the test set points with low/median/high reconstruction errors.
# We will now visualize the original test set points and their reconstructions
obtained
# by propagating them through the narrow neural net.

# Convert the test data into its autoencoded representation (pass through
narrow neural net)
test_recon = ae_model.predict(test)

# In python, the visualization could be done with tools like numpy/matplotlib


or numpy/PIL

if __name__ == '__main__':
h2o.run_test(sys.argv, anomaly)
edited Jul 23 '15 at 16:09 answered Jul 22 '15 at 21:42
0xF
302 1 7

Thanks! I haven't considered this package yet - I will add it to the list of candidates. To clarify, when you say "beyond
version 3 it has similar module available in python as well", do you know if h2o's anomaly detection module (beyond
ver 3) is available in Python, or some other module? – ximiki Jul 23 '15 at 13:52

1 @ximik Well,I revisited the python documentation of their latest version 3.0.0.26(h2o-
release.s3.amazonaws.com/h2o/rel-shannon/26/docs-website/…) and it seems like h2o.anomaly is not yet available
unlike its R api.I've raised the question in their google
group(groups.google.com/forum/#!topic/h2ostream/uma3UdpanEI) and you can follow that. – 0xF Jul 23 '15 at 15:48

1 Well,h2o support group has answered the question and anomaly is available in python as well.An example is
available here. github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/… – 0xF Jul 23 '15 at 16:07

Perfect! thank you for investigating. i'll update this post with results. – ximiki Jul 23 '15 at 16:58

I assume the feature you use to detect abnormality is one row of data in a logfile. If so, Sklearn
is your good friend and you can use it as a blackbox. Check the tutorial of one-class SVM and
Novelty detection.

However, in case that your feature is an entire logfile, you need to first summarize it to some
feature of same dimension, and then apply Novealty detection.

answered Jul 24 '15 at 5:02


Rex
119 1

I am currently on same stage like you. I am finding best option for anomaly detection, doing
some research.

What I have found is I think best matches your need and is better compare to what you have
seen. i.e., TwitterAnomalyDetection, SkyLine.

I have found better is Numenta's NAB (Numenta Anomaly Benchmark). It also have a very
good community support and for you plus point is its open source & developed in python. You
can add your algorithm in it.

In case of algorithm, I found LOF or CBLOF are good option.

so, check it out once. It may help you out. https://github.com/numenta/nab

If you find better option. please, tell me. I am also on the same path.

Best Luck!!

answered Feb 3 '16 at 6:14


Divyang Shah
121 2

Thanks for the valuable info! I will definitely check this out. – ximiki Feb 4 '16 at 14:05

2 I just wanted to return and comment on how applicable NAB seems to my problem. The only drawback I can see is
that this is only for univariate (one column) time-series anomaly detection, but what about multivariate (many
columns)? Thank you for this suggestion, I am going to push it to the shortlist for solution candidates. – ximiki Jul 8
'16 at 15:36

Você também pode gostar