Mining distance-based outliers from large databases in any metric space

Let R be a set of objects. An object o ∈ R is an outlier, if there exist less than k objects in R whose distances to o are at most r. The values of k, r, and the distance metric are provided by a user at the run time. The objective is to return all outliers with the smallest I/O cost.This paper considers a generic version of the problem, where no information is available for outlier computation, except for objects' mutual distances. We prove an upper bound for the memory consumption which permits the discovery of all outliers by scanning the dataset 3 times. The upper bound turns out to be extremely low in practice, e.g., less than 1% of R. Since the actual memory capacity of a realistic DBMS is typically larger, we develop a novel algorithm, which integrates our theoretical findings with carefully-designed heuristics that leverage the additional memory to improve I/O efficiency. Our technique reports all outliers by scanning the dataset at most twice (in some cases, even once), and significantly outperforms the existing solutions by a factor up to an order of magnitude.

[1]  Jim Freeman,et al.  Outliers in Statistical Data (3rd edition) , 1995 .

[2]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[3]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[4]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[5]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[6]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[7]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[8]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[10]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[11]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[12]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[13]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[14]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[15]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[16]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.