Fast Outlier Detection in High Dimensional Spaces

In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d ? d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.

[1]  Graham J. Williams,et al.  On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms , 2000, KDD '00.

[2]  Zbigniew R. Struzik,et al.  Outlier detection and localisation with wavelet based multifractal formalism , 2000 .

[3]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[6]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[7]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[8]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[9]  H. Sagan Space-filling curves , 1994 .

[10]  Mario A. López,et al.  Finding k-Closest-Pairs Efficiently for High Dimensional Data , 2000, CCCG.

[11]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  Nimrod Megiddo,et al.  Fast indexing method for multidimensional nearest-neighbor search , 1998, Electronic Imaging.

[14]  Kenji Yamanishi,et al.  Discovering outlier filtering rules from unlabeled data , 2001, KDD 2001.

[15]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[16]  Kenji Yamanishi,et al.  Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner , 2001, KDD '01.

[17]  H. V. Jagadish,et al.  Linear clustering of objects with multiple attributes , 1990, SIGMOD '90.

[18]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.