LOF: identifying density-based local outliers

For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.

[1]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[2]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[3]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[4]  A. Madansky Identification of Outliers , 1988 .

[5]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[6]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  P. Rousseeuw,et al.  Computing depth contours of bivariate point clouds , 1996 .

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[11]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[12]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[13]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[14]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[15]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[16]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[17]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[18]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[19]  William DuMouchel,et al.  A Fast Computer Intrusion Detection Algorithm Based on Hypothesis Testing of Command Transition Probabilities , 1998, KDD.

[20]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[21]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[22]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.