A Comparative Evaluation of Anomaly Explanation Algorithms

Detection of anomalies (i.e., outliers) in multi-dimensional data is a well-studied subject in machine learning. Unfortunately, unsupervised detectors provide no explanation about why a data point was considered as abnormal or which of its features (i.e. subspaces) exhibit at best its outlyingness. Such outlier explanations are crucial to diagnose the root cause of data anomalies and enable corrective actions to prevent or remedy their effect in downstream data processing. In this work, we present a comprehensive framework for comparing different unsupervised outlier explanation algorithms that are domain and detector-agnostic. Using real and synthetic datasets, we assess the effectiveness and efficiency of two point explanation algorithms (Beam [28] and RefOut [18]) ranking subspaces that best explain the outlyingness of individual data points and two explanation summarization algorithms (LookOut [15] and HiCS [17]) ranking subspaces that best exhibit as many outlier points from inliers as possible. To the best of our knowledge, this is the first detailed evaluation of existing explanation algorithms aiming to uncover several missing insights from the literature such as: (a) Is it effective to combine any explanation algorithm with any off-the-shelf outlier detector? (b) How is the behavior of an outlier detection and explanation pipeline affected by the number or the correlation of features in a dataset? and (c) What is the quality of summaries in the presence of outliers explained by subspaces of different dimensionality?

[1]  Arthur Zimek,et al.  There and back again: Outlier detection between statistical reasoning and data mining algorithms , 2018, WIREs Data Mining Knowl. Discov..

[2]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[3]  Klemens Böhm,et al.  Dimension-based subspace search for outlier detection , 2018, International Journal of Data Science and Analytics.

[4]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Leman Akoglu,et al.  Explaining anomalies in groups with characterizing subspace rules , 2017, Data Mining and Knowledge Discovery.

[6]  Marko Robnik-Sikonja,et al.  Explaining Classifications For Individual Instances , 2008, IEEE Transactions on Knowledge and Data Engineering.

[7]  Abdul Nurunnabi,et al.  Outlier Detection in Logistic Regression: A Quest for Reliable Knowledge from Predictive Modeling and Classification , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[8]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[9]  Yue Zhao,et al.  PyOD: A Python Toolbox for Scalable Outlier Detection , 2019, J. Mach. Learn. Res..

[10]  Erik Strumbelj,et al.  Explaining prediction models and individual predictions with feature contributions , 2014, Knowledge and Information Systems.

[11]  Samuel Madden,et al.  MacroBase: Prioritizing Attention in Fast Data , 2016, SIGMOD Conference.

[12]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[13]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Klemens Böhm,et al.  Flexible and adaptive subspace search for outlier analysis , 2013, CIKM.

[15]  Haopeng Zhang,et al.  EXstream: Explaining Anomalies in Event Stream Monitoring , 2017, EDBT.

[16]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[17]  James Bailey,et al.  Discovering outlying aspects in large datasets , 2016, Data Mining and Knowledge Discovery.

[18]  Subutai Ahmad,et al.  Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[19]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  Erik Strumbelj,et al.  An Efficient Explanation of Individual Classifications using Game Theory , 2010, J. Mach. Learn. Res..

[22]  Herman Aguinis,et al.  Best-Practice Recommendations for Defining, Identifying, and Handling Outliers , 2013 .

[23]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[24]  Christos Faloutsos,et al.  Beyond Outlier Detection: LookOut for Pictorial Explanation , 2018, ECML/PKDD.

[25]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  Kai Ming Ting,et al.  Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors , 2016, Machine Learning.

[27]  TahaAyman,et al.  Anomaly Detection Methods for Categorical Data , 2019 .

[28]  Mu Zhu,et al.  A Relationship between the Average Precision and the Area Under the ROC Curve , 2015, ICTIR.

[29]  Arthur Zimek,et al.  ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg" , 2019, ArXiv.

[30]  PolyzotisNeoklis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018 .

[31]  Laurence A. Wolsey,et al.  Best Algorithms for Approximating the Maximum of a Submodular Set Function , 1978, Math. Oper. Res..

[32]  Cyrus Shahabi,et al.  Distance-based Outlier Detection in Data Streams , 2016, Proc. VLDB Endow..

[33]  Vassilis Christophides,et al.  A greedy feature selection algorithm for Big Data of high dimensionality , 2018, Machine Learning.

[34]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[35]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[37]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[38]  Charu C. Aggarwal,et al.  Subspace Outlier Detection in Linear Time with Randomized Hashing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[39]  Navindra Yadav,et al.  ExplainIt! -- A Declarative Root-cause Analysis Engine for Time Series Data , 2019, SIGMOD Conference.

[40]  Divesh Srivastava,et al.  Empirical glitch explanations , 2014, KDD.

[41]  Alessandro Rinaldo,et al.  Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection , 2019, NeurIPS.

[42]  Dan Suciu,et al.  Explaining Query Answers with Explanation-Ready Databases , 2015, Proc. VLDB Endow..

[43]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[44]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[45]  B. L. Welch THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN TWO MEANS WHEN THE POPULATION VARIANCES ARE UNEQUAL , 1938 .

[46]  Charu C. Aggarwal,et al.  An Introduction to Outlier Analysis , 2013 .

[47]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[48]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[49]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[50]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.