Very Fast Outlier Detection in Large Multidimensional Data Sets

Outliers are objects that do not comply with the general behavior of the data. Applications such as exploration in science databases need fast interactive tools for outlier detection in data sets that have unknown distributions, are large in size, and are in high dimensional space. Existing algorithms for outlier detection are too slow for such applications. We present an algorithm based on an innovative use of k-d trees that doesn’t assume any probability model and is linear in the number of objects and in the number of dimensions. We also provide experimental results that show that this is indeed a practical solution to the above problem.

[1]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[2]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[3]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[4]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[5]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[6]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[7]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[8]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[11]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[12]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[13]  Andrew W. Moore,et al.  Repairing Faulty Mixture Models using Density Estimation , 2001, ICML.

[14]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[15]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[16]  Alexander S. Szalay,et al.  Applied spatial data structures for large data sets , 2003 .