论文信息 - Rapid Distance-Based Outlier Detection via Sampling

Rapid Distance-Based Outlier Detection via Sampling

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.

Karsten M. Borgwardt | Mahito Sugiyama | K. Borgwardt | M. Sugiyama

[1] Hans-Peter Kriegel,et al. Angle-based outlier detection in high-dimensional data , 2008, KDD.

[2] Hans-Peter Kriegel,et al. A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[3] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[4] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[5] Graham J. Williams,et al. On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[6] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[7] Kanishka Bhaduri,et al. Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[8] Fei Tony Liu,et al. Isolation-Based Anomaly Detection , 2012, TKDD.

[9] Stephen D. Bay,et al. Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[10] Srinivasan Parthasarathy,et al. Distance-based outlier detection , 2010, Proc. VLDB Endow..

[11] M.M. Deris,et al. A Comparative Study for Outlier Detection Techniques in Data Mining , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[12] Sanjay Chawla,et al. Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[13] Charu C. Aggarwal,et al. Outlier Analysis , 2013, Springer New York.

[14] Hans-Peter Kriegel,et al. The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[15] Aidong Zhang,et al. FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[16] Rasmus Pagh,et al. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data , 2012, KDD.

[17] Chris Jermaine,et al. Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[18] Raymond T. Ng,et al. Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[19] Hans-Jörg Schek,et al. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[20] A. Madansky. Identification of Outliers , 1988 .

[21] Raymond T. Ng,et al. Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.