Randomized outlier detection with trees

Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data.

[1]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[2]  Benjamin Peherstorfer,et al.  Density Estimation with Adaptive Sparse Grids for Large Data Sets , 2014, SDM.

[3]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[6]  Zhi-Hua Zhou,et al.  On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.

[7]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[8]  Zengan Gao,et al.  A framework for data mining‐based anti‐money laundering research , 2007 .

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[11]  Kai Ming Ting,et al.  Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors , 2016, Machine Learning.

[12]  Md. Rafiqul Islam,et al.  A survey of anomaly detection techniques in financial domain , 2016, Future Gener. Comput. Syst..

[13]  Kate Smith-Miles,et al.  A Comprehensive Survey of Data Mining-based Fraud Detection Research , 2010, ArXiv.

[14]  Robert J. Brunner,et al.  Extended Isolation Forest , 2018, IEEE Transactions on Knowledge and Data Engineering.

[15]  W. J. Dixon,et al.  Analysis of Extreme Values , 1950 .

[16]  Stefan Axelsson,et al.  Paysim: a financial mobile money simulator for fraud detection , 2016 .

[17]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[18]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[19]  Cesare Alippi,et al.  Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning Strategy , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Germain Forestier,et al.  Deep learning for time series classification: a review , 2018, Data Mining and Knowledge Discovery.

[21]  Tanveer A. Faruquie,et al.  Anomaly Detection in Finance: Editors' Introduction , 2017, ADF@KDD.

[22]  Kai Ming Ting,et al.  Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[23]  Natalie Klein Density Estimation Trees , 2015 .

[24]  P. Deb Finite Mixture Models , 2008 .

[25]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[26]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[27]  Zengan Gao,et al.  Application of Cluster-Based Local Outlier Factor Algorithm in Anti-Money Laundering , 2009, 2009 International Conference on Management and Service Science.

[28]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[29]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[30]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[31]  Charu C. Aggarwal,et al.  Theoretical Foundations and Algorithms for Outlier Ensembles , 2015, SKDD.