Generalized Isolation Forest: Some Theory and More Applications Extended Abstract
暂无分享,去创建一个
Isolation Forest is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random decision trees. Multiple extensions enhance the original Isolation Forest algorithm including the Extended Isolation Forest which allows for non-rectangular splits and the SCiForest which improves the fitting of individual trees. All these approaches rate the outlierness of an observation by its average path-length. However, we find a lack of theoretical explanation on why these isolation-based algorithms offer such good practical performance. In this paper, we present a theoretical framework that describes the effectiveness of isolation-based approaches from a distributional viewpoint. We show that these algorithms fit a mixture of distributions, where the average path length of an observation can be viewed as a (somewhat crude) approximation of the mixture coefficient. Using this framework, we derive the Generalized Isolation Forest (GIF) which also trains random trees, but combining them moves beyond using the average path-length. In an extensive evaluation of over 350, 000 experiments, we show that GIF outperforms the other methods on a variety of datasets while having comparable runtime.
[1] Zhi-Hua Zhou,et al. On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.
[2] Arthur Zimek,et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.
[3] Robert J. Brunner,et al. Extended Isolation Forest , 2018, IEEE Transactions on Knowledge and Data Engineering.
[4] Zhi-Hua Zhou,et al. Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.