Explaining anomalies with Sapling Random Forests

The main objective of anomaly or outlier detection algorithms is finding samples deviating from the majority. Although a vast number of algorithms designed for this already exist, almost none of them explain, why a particular sample was labelled as an anomaly (outlier). To address this issue, we propose an algorithm called Explainer, which returns the explanation of sample’s differentness in disjunctive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algorithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data-streams, and real-time problems. The correctness of Explainer is demonstrated on a wide range of synthetic and real world datasets.

[1]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[3]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[4]  G. Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[5]  R. Tibshirani,et al.  Outlier sums for differential gene expression analysis. , 2007, Biostatistics.

[6]  Osmar R. Zaïane,et al.  Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[8]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[9]  Bernd Freisleben,et al.  CARDWATCH: a neural network based database mining system for credit card fraud detection , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[10]  Zengyou He,et al.  A Unified Subspace Outlier Ensemble Framework for Outlier Detection , 2005, WAIM.

[11]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Gabriel Maciá-Fernández,et al.  Anomaly-based network intrusion detection: Techniques, systems and challenges , 2009, Comput. Secur..

[14]  Ira Assent,et al.  Local Outlier Detection with Interpretation , 2013, ECML/PKDD.