论文信息 - Interpretable Anomaly Detection with Mondrian P{ó}lya Forests on Data Streams - 字舞流文

Interpretable Anomaly Detection with Mondrian P{ó}lya Forests on Data Streams

Anomaly detection at scale is an extremely challenging problem of great practicality. When data is large and high-dimensional, it can be difficult to detect which observations do not fit the expected behaviour. Recent work has coalesced on variations of (random) $k$\emph{d-trees} to summarise data for anomaly detection. However, these methods rely on ad-hoc score functions that are not easy to interpret, making it difficult to asses the severity of the detected anomalies or select a reasonable threshold in the absence of labelled anomalies. To solve these issues, we contextualise these methods in a probabilistic framework which we call the Mondrian \Polya{} Forest for estimating the underlying probability density function generating the data and enabling greater interpretability than prior work. In addition, we develop a memory efficient variant able to operate in the modern streaming environments. Our experiments show that these methods achieves state-of-the-art performance while providing statistically interpretable anomaly scores.

Tom Diethe | Eric Meissner | Charlie Dickens | Pablo G. Moreno

[1] Yee Whye Teh,et al. The Mondrian Process for Machine Learning , 2015, 1507.05181.

[2] Yee Whye Teh,et al. Mondrian Forests: Efficient Online Random Forests , 2014, NIPS.

[3] Abhiram Mullapudi,et al. rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams , 2019, J. Open Source Softw..

[4] Yee Whye Teh,et al. The Mondrian Kernel , 2016, UAI.

[5] Charu C. Aggarwal,et al. Outlier Analysis , 2013, Springer New York.

[6] M. Shyu,et al. A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[7] J. Weston,et al. Support Vector Machine Solvers , 2007 .

[8] Divesh Srivastava,et al. Differentially Private Spatial Decompositions , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[9] Zhi-Hua Zhou,et al. Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10] Subutai Ahmad,et al. Unsupervised real-time anomaly detection for streaming data , 2017, Neurocomputing.

[11] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[12] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[13] Balaji Lakshminarayanan,et al. Decision trees and forests: a probabilistic perspective , 2016 .

[14] Yue Zhao,et al. PyOD: A Python Toolbox for Scalable Outlier Detection , 2019, J. Mach. Learn. Res..

[15] Yee Whye Teh,et al. The Mondrian Process , 2008, NIPS.

[16] Clara Pizzuti,et al. Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[17] Vatsal Sharan,et al. PIDForest: Anomaly Detection via Partial Identification , 2019, NeurIPS.

[18] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19] Thomas Mikosch,et al. The Univariate Case , 2016 .

[20] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[21] Luís Torgo,et al. OpenML: networked science in machine learning , 2014, SKDD.

[22] Glencora Borradaile,et al. Whose tweets are surveilled for the police: an audit of a social-media monitoring tool via log files , 2020, FAT*.

[23] Peter A. Flach,et al. Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[24] Tom Diethe,et al. Continual Learning in Practice , 2019, NeurIPS 2019.

[25] Ian Davidson,et al. A Framework for Determining the Fairness of Outlier Detection , 2020, ECAI.

[26] Sudipto Guha,et al. Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.