Revisiting randomized choices in isolation forests

Isolation forest or ”iForest” is an intuitive and widely used algorithm for anomaly detection that follows a simple yet effective idea: in a given data distribution, if a threshold (split point) is selected uniformly at random within the range of some variable and data points are divided according to whether they are greater or smaller than this threshold, outlier points are more likely to end up alone or in the smaller partition. The original procedure suggested the choice of variable to split and split point within a variable to be done uniformly at random at each step, but this paper shows that ”clustered” diverse outliers oftentimes a more interesting class of outliers than others can be more easily identified by applying a non-uniformlyrandom choice of variables and/or thresholds. Different split guiding criteria are compared and some are found to result in significantly better outlier discrimination for certain classes of outliers.

[1]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[2]  Sebastian Buschjäger,et al.  Randomized outlier detection with trees , 2020, International Journal of Data Science and Analytics.

[3]  Kai Ming Ting,et al.  Isolation‐based anomaly detection using nearest‐neighbor ensembles , 2018, Comput. Intell..

[4]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[6]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[7]  David Cortes Imputing missing values with unsupervised random trees , 2019, ArXiv.

[8]  Robert J. Brunner,et al.  Extended Isolation Forest , 2018, IEEE Transactions on Knowledge and Data Engineering.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Sudipto Guha,et al.  Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.

[11]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[12]  Maël Chiapino,et al.  One Class Splitting Criteria for Random Forests , 2016, ACML.

[13]  Zhi-Hua Zhou,et al.  On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Natalie Klein Density Estimation Trees , 2015 .

[16]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.