A study on anomaly detection ensembles

Abstract An anomaly, or outlier, is an object exhibiting differences that suggest it belongs to an as-yet undefined class or category. Early detection of anomalies often proves of great importance because they may correspond to events such as fraud, spam, or device malfunctions. By automating the creation of a ranking or list of deviations, we can save time and decrease the cognitive overload of the individuals or groups responsible for responding to such events. Over the years many anomaly and outlier metrics have been developed. In this paper we propose a clustering-based score ensembling method for outlier detection. Using benchmark datasets we evaluate quantitatively the robustness and accuracy of different ensemble strategies. We find that ensembling strategies offer only limited value for increasing overall performance, but provide robustness by negating the influence of severely underperforming models.

[1]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[2]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Leman Akoglu,et al.  Less is More , 2016, ACM Trans. Knowl. Discov. Data.

[4]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[5]  Yuh-Jye Lee,et al.  Anomaly Detection via Online Oversampling Principal Component Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  R. Shiffler Maximum Z Scores and Outliers , 1988 .

[7]  Gilles Louppe,et al.  Independent consultant , 2013 .

[8]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[9]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[10]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[11]  Brendan J. Frey,et al.  A Binary Variable Model for Affinity Propagation , 2009, Neural Computation.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[14]  Carrie Gates,et al.  Challenging the anomaly detection paradigm: a provocative discussion , 2006, NSPW '06.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[17]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[18]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[23]  H. Abdi The Kendall Rank Correlation Coefficient , 2007 .

[24]  Rik Warren,et al.  Use of Mahalanobis Distance for Detecting Outliers and Outlier Clusters in Markedly Non-Normal Data: A Vehicular Traffic Example , 2011 .

[25]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[26]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[27]  Lior Rokach,et al.  Ensemble Methods for Classifiers , 2005, The Data Mining and Knowledge Discovery Handbook.

[28]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[29]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Charu C. Aggarwal,et al.  Outlier ensembles: position paper , 2013, SKDD.

[31]  Ira Assent,et al.  Learning Outlier Ensembles: The Best of Both Worlds - Supervised and Unsupervised , 2014 .

[32]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).