Outlier Detection with Streaming Dyadic Decomposition

In this work we introduce a new algorithm for detecting outliers on streaming data in Rn. The basic idea is to compute a dyadic decomposition into cubes in Rn of the streaming data. Dyadic decomposition can be obtained by recursively bisecting the cube the data lies in. Dyadic decomposition obtained under streaming setting is understood as streaming dyadic decomposition. If we view the streaming dyadic decomposition as a tree with a fixed maximum (and sufficient) size (depth), then outliers are naturally defined by cubes that contain a small number of points in the cube itself or the cube itself and its neighboring cubes. We discuss some properties of detecting outliers with streaming dyadic decomposition and we present experimental results over real and artificial data sets.

[1]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[2]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[3]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[4]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[5]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[6]  Jeffrey Scott Vitter,et al.  Mining deviants in time series data streams , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[7]  J. Cooper SINGULAR INTEGRALS AND DIFFERENTIABILITY PROPERTIES OF FUNCTIONS , 1973 .

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  E. Schikuta GRID-CLUSTERING: A FAST HIERARCHICAL CLUSTERING METHOD FOR VERY LARGE DATA SETS , 1993 .

[10]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[11]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[13]  Stephen J. Roberts,et al.  Parametric and non-parametric unsupervised cluster analysis , 1997, Pattern Recognit..

[14]  G. Krishna,et al.  A heuristic clustering algorithm using union of overlapping pattern-cells , 1979, Pattern Recognit..

[15]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[16]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[17]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[18]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[19]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[20]  Yiu-Fai Wong,et al.  Clustering Data by Melting , 1993, Neural Computation.

[21]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[22]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[23]  Fabio A. González,et al.  TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model , 2003, Third IEEE International Conference on Data Mining.

[24]  Joydeep Ghosh,et al.  Scale-based clustering using the radial basis function network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[25]  Raymond T. Ng,et al.  Very large data bases , 1994 .

[26]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27]  Erich Schikuta,et al.  The BANG-Clustering System: Grid-Based Data Analysis , 1997, IDA.