A Meta-Analysis of the Anomaly Detection Problem

This article provides a thorough meta-analysis of the anomaly detection problem. To accomplish this we first identify approaches to benchmarking anomaly detection algorithms across the literature and produce a large corpus of anomaly detection benchmarks that vary in their construction across several dimensions we deem important to real-world applications: (a) point difficulty, (b) relative frequency of anomalies, (c) clusteredness of anomalies, and (d) relevance of features. We apply a representative set of anomaly detection algorithms to this corpus, yielding a very large collection of experimental results. We analyze these results to understand many phenomena observed in previous work. First we observe the effects of experimental design on experimental results. Second, results are evaluated with two metrics, ROC Area Under the Curve and Average Precision. We employ statistical hypothesis testing to demonstrate the value (or lack thereof) of our benchmarks. We then offer several approaches to summarizing our experimental results, drawing several conclusions about the impact of our methodology as well as the strengths and weaknesses of some algorithms. Last, we compare results against a trivial solution as an alternate means of normalizing the reported performance of algorithms. The intended contributions of this article are many; in addition to providing a large publicly-available corpus of anomaly detection benchmarks, we provide an ontology for describing anomaly detection contexts, a methodology for controlling various aspects of benchmark creation, guidelines for future experimental design and a discussion of the many potential pitfalls of trying to measure success in this field.

[1]  Thomas G. Dietterich,et al.  Guiding Scientific Discovery with Explanations Using DEMUD , 2013, AAAI.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[4]  Joshua Glasser,et al.  Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data , 2013, 2013 IEEE Security and Privacy Workshops.

[5]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[6]  Feng Xue,et al.  Operational Data Based Anomaly Detection for Locomotive Diagnostics , 2006, MLMTA.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Lin Zhang,et al.  Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection , 2014, Knowl. Based Syst..

[9]  Yves Bestgen,et al.  Exact Expected Average Precision of the Random Baseline for System Evaluation , 2015, Prague Bull. Math. Linguistics.

[10]  Mustafa Gul,et al.  An Improved Methodology for Anomaly Detection Based on Time Series Modeling , 2013 .

[11]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[12]  Kemal Polat,et al.  A New Classification Method for Breast Cancer Diagnosis: Feature Selection Artificial Immune Recognition System (FS-AIRS) , 2005, ICNC.

[13]  T. Lane,et al.  Sequence Matching and Learning in Anomaly Detection for Computer Security , 1997 .

[14]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[15]  Jeff G. Schneider,et al.  Anomaly pattern detection in categorical datasets , 2008, KDD.

[16]  Slim Abdennadher,et al.  Enhancing one-class support vector machines for unsupervised anomaly detection , 2013, ODD '13.

[17]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[18]  Zhi-Hua Zhou,et al.  On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.

[19]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[20]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[21]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[22]  B. Aggarwal,et al.  Models for prevention and treatment of cancer: problems vs promises. , 2009, Biochemical pharmacology.

[23]  Magnus Löfstrand,et al.  Increasing availability of industrial systems through data stream mining , 2011, Comput. Ind. Eng..

[24]  Clayton D. Scott,et al.  Robust Kernel Density Estimation by Scaling and Projection in Hilbert Space , 2014, NIPS.

[25]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[26]  Kishan G. Mehrotra,et al.  An Online Anomalous Time Series Detection Algorithm for Univariate Data Streams , 2013, IEA/AIE.

[27]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[28]  Hai Sheng Li An Intrusion Detection Based on Markov Model , 2011 .