LOCI: fast outlier detection using the local correlation integral

Outlier detection is an integral part of data mining and has attracted much attention recently [M. Breunig et al., (2000)], [W. Jin et al., (2001)], [E. Knorr et al., (2000)]. We propose a new method for evaluating outlierness, which we call the local correlation integral (LOCI). As with the best previous methods, LOCI is highly effective for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following advantages and novelties: (a) It provides an automatic, data-dictated cutoff to determine whether a point is an outlier-in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score, (c) Our LOCI method can be computed as quickly as the best previous methods, (d) Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI), which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first work to use approximate computations to speed up outlier detection. Experiments on synthetic and real world data sets show that LOCI and aLOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected and unexpected outliers.

[1]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[2]  T. Figiel,et al.  The dimension of almost spherical sections of convex bodies , 1976 .

[3]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[4]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[5]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[6]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[7]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[8]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[9]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[10]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[11]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[12]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[13]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[14]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[17]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[18]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[19]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[20]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[21]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[22]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[23]  Christos Faloutsos,et al.  Tri-plots: scalable tools for multidimensional data mining , 2001, KDD '01.

[24]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.