Systematic construction of anomaly detection benchmarks from real data

Research in anomaly detection suffers from a lack of realistic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classification data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data sets that vary along three important dimensions: (a) point difficulty, (b) relative frequency of anomalies, and (c) clusteredness. We apply our generated datasets to benchmark several popular anomaly detection algorithms under a range of different conditions.

[1]  Magnus Löfstrand,et al.  Increasing availability of industrial systems through data stream mining , 2011, Comput. Ind. Eng..

[2]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[3]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[4]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[5]  Yves Bestgen,et al.  Exact Expected Average Precision of the Random Baseline for System Evaluation , 2015, Prague Bull. Math. Linguistics.

[6]  A. Liu,et al.  A comparison of system call feature representations for insider threat detection , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[7]  Lin Zhang,et al.  Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection , 2014, Knowl. Based Syst..

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[10]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[12]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[13]  Thomas G. Dietterich,et al.  Detecting insider threats in a real corporate database of computer usage activity , 2013, KDD.

[14]  Danai Koutra,et al.  Fast anomaly detection despite the duplicates , 2013, WWW.

[15]  ElkanCharles Results of the KDD'99 classifier learning , 2000 .

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[17]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[18]  Philip K. Chan,et al.  An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection , 2003, RAID.

[19]  A. Madansky Identification of Outliers , 1988 .

[20]  Kishan G. Mehrotra,et al.  An Online Anomalous Time Series Detection Algorithm for Univariate Data Streams , 2013, IEA/AIE.

[21]  Mustafa Gul,et al.  An Improved Methodology for Anomaly Detection Based on Time Series Modeling , 2013 .

[22]  Blair D. Sullivan,et al.  Multi-Level Anomaly Detection on Streaming Graph Data , 2014, ArXiv.

[23]  Julie Greensmith,et al.  Dendritic cells for SYN scan detection , 2007, GECCO '07.

[24]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[25]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[26]  Joshua Glasser,et al.  Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data , 2013, 2013 IEEE Security and Privacy Workshops.

[27]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[28]  Dorothy E. Denning,et al.  An Intrusion-Detection Model , 1987, IEEE Transactions on Software Engineering.

[29]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[30]  Julie Greensmith,et al.  Dendritic Cells for Anomaly Detection , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[31]  Jeff G. Schneider,et al.  Anomaly pattern detection in categorical datasets , 2008, KDD.

[32]  Slim Abdennadher,et al.  Enhancing one-class support vector machines for unsupervised anomaly detection , 2013, ODD '13.

[33]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[34]  Zhi-Hua Zhou,et al.  On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.

[35]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[36]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[37]  B. Aggarwal,et al.  Models for prevention and treatment of cancer: problems vs promises. , 2009, Biochemical pharmacology.

[38]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[39]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[40]  Gq Huang,et al.  Computers & Industrial Engineering , 2015 .

[41]  Mohammad Zulkernine,et al.  Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection , 2006, 2006 IEEE International Conference on Communications.

[42]  T. Lane,et al.  Sequence Matching and Learning in Anomaly Detection for Computer Security , 1997 .

[43]  Clayton D. Scott,et al.  Robust kernel density estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[45]  Marimuthu Palaniswami,et al.  Centered Hyperspherical and Hyperellipsoidal One-Class Support Vector Machines for Anomaly Detection in Sensor Networks , 2010, IEEE Transactions on Information Forensics and Security.

[46]  Kemal Polat,et al.  A New Classification Method for Breast Cancer Diagnosis: Feature Selection Artificial Immune Recognition System (FS-AIRS) , 2005, ICNC.

[47]  Bin Zhang,et al.  Anomaly detection: A robust approach to detection of unanticipated faults , 2008, 2008 International Conference on Prognostics and Health Management.

[48]  Thomas G. Dietterich,et al.  Guiding Scientific Discovery with Explanations Using DEMUD , 2013, AAAI.

[49]  Feng Xue,et al.  Operational Data Based Anomaly Detection for Locomotive Diagnostics , 2006, MLMTA.

[50]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  Thomas G. Dietterich,et al.  Spatiotemporal Models for Data-Anomaly Detection in Dynamic Environmental Monitoring Campaigns , 2011, TOSN.

[53]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.