CURIO : A Fast Outlier Clustering Algorithm for Large Datasets∗

Outlier (or anomaly) detection is an important problem for many domains, including fraud detection, risk analysis, network intrusion and medical diagnosis, and the discovery of significant outliers is becoming an integral aspect of data mining. This paper presents CURIO, a novel algorithm that uses quantisation and implied distance metrics to provide a fast algorithm that is linear with respect to dataset size and only requires two sequential scans of disk resident datasets. CURIO includes a novel direct quantisation technique and the explicit discovery of outlier clusters. Moreover, a major attribute of CURIO is its simplicity and economy with respect to algorithm, memory footprint and data structures.

[1]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[2]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[3]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[4]  Yufei Tao,et al.  Mining distance-based outliers from large databases in any metric space , 2006, KDD '06.

[5]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[6]  Raymond T. Ng,et al.  Outliers and data mining: finding exceptions in data , 2002 .

[7]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[8]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[9]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[10]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[11]  Zbigniew R. Struzik,et al.  Wavelet transform based multifractal formalism in outlier detection and localisation for financial time series , 2002 .

[12]  Ada Wai-Chee Fu,et al.  Enhancements on local outlier detection , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[13]  John F. Roddick,et al.  Exploratory medical knowledge discovery: experiences and issues , 2003, SKDD.

[14]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[15]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[16]  John F. Roddick,et al.  Mining Medical Administrative Data-The PKB System , 2006 .

[17]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[18]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[19]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[20]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[21]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Dong Xiang,et al.  Information-theoretic measures for anomaly detection , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[24]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[25]  Lionel Tarassenko,et al.  A System for the Analysis of Jet Engine Vibration Data , 1999, Integr. Comput. Aided Eng..

[26]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[27]  Calyampudi R. Rao,et al.  Anthropometric survey of the United Provinces, 1941: a statistical study. , 1949 .

[28]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[29]  Edwina L. Rissland,et al.  Inductive Learning in a Mixed Paradigm Setting , 1990, AAAI.

[30]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[31]  Alexander S. Szalay,et al.  Very Fast Outlier Detection in Large Multidimensional Data Sets , 2002, DMKD.

[32]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[33]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[34]  David M. W. Powers,et al.  A Unified Taxonomic Framework for Information Visualization , 2003, InVis.au.

[35]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[36]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers: algorithms and applications (a summary of results) , 2001, KDD '01.

[37]  D. Hand,et al.  Unsupervised Profiling Methods for Fraud Detection , 2002 .

[38]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[39]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.