Interpretable Anomaly Detection with Mondrian P{ó}lya Forests on Data Streams

Anomaly detection at scale is an extremely challenging problem of great practicality. When data is large and high-dimensional, it can be difficult to detect which observations do not fit the expected behaviour. Recent work has coalesced on variations of (random) $k$\emph{d-trees} to summarise data for anomaly detection. However, these methods rely on ad-hoc score functions that are not easy to interpret, making it difficult to asses the severity of the detected anomalies or select a reasonable threshold in the absence of labelled anomalies. To solve these issues, we contextualise these methods in a probabilistic framework which we call the Mondrian \Polya{} Forest for estimating the underlying probability density function generating the data and enabling greater interpretability than prior work. In addition, we develop a memory efficient variant able to operate in the modern streaming environments. Our experiments show that these methods achieves state-of-the-art performance while providing statistically interpretable anomaly scores.

[1]  Yee Whye Teh,et al.  The Mondrian Process for Machine Learning , 2015, 1507.05181.

[2]  Yee Whye Teh,et al.  Mondrian Forests: Efficient Online Random Forests , 2014, NIPS.

[3]  Abhiram Mullapudi,et al.  rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams , 2019, J. Open Source Softw..

[4]  Yee Whye Teh,et al.  The Mondrian Kernel , 2016, UAI.

[5]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[6]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[7]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[8]  Divesh Srivastava,et al.  Differentially Private Spatial Decompositions , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[9]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Subutai Ahmad,et al.  Unsupervised real-time anomaly detection for streaming data , 2017, Neurocomputing.

[11]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[13]  Balaji Lakshminarayanan,et al.  Decision trees and forests: a probabilistic perspective , 2016 .

[14]  Yue Zhao,et al.  PyOD: A Python Toolbox for Scalable Outlier Detection , 2019, J. Mach. Learn. Res..

[15]  Yee Whye Teh,et al.  The Mondrian Process , 2008, NIPS.

[16]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[17]  Vatsal Sharan,et al.  PIDForest: Anomaly Detection via Partial Identification , 2019, NeurIPS.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Thomas Mikosch,et al.  The Univariate Case , 2016 .

[20]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[21]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[22]  Glencora Borradaile,et al.  Whose tweets are surveilled for the police: an audit of a social-media monitoring tool via log files , 2020, FAT*.

[23]  Peter A. Flach,et al.  Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[24]  Tom Diethe,et al.  Continual Learning in Practice , 2019, NeurIPS 2019.

[25]  Ian Davidson,et al.  A Framework for Determining the Fairness of Outlier Detection , 2020, ECAI.

[26]  Sudipto Guha,et al.  Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.