Continuous Adaptive Outlier Detection on Distributed Data Streams

In many applications, stream data are too voluminous to be collected in a central fashion and often transmitted on a distributed network. In this paper, we focus on the outlier detection over distributed data streams in real time, firstly, we formalize the problem of outlier detection using the kernel density estimation technique. Then, we adopt the fading strategy to keep pace with the transient and evolving natures of stream data, and mico-cluster technique to conquer the data partition and "one-pass" scan. Furthermore, our extensive experiments with synthetic and real data show that the proposed algorithm is efficient and effective compared with existing outlier detection algorithms, and more suitable for data streams.

[1]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[2]  Irène Gijbels,et al.  Automatic Forecasting via Exponential Smoothing Asymptotic Properties , 1997 .

[3]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[5]  Yufei Tao,et al.  Mining distance-based outliers from large databases in any metric space , 2006, KDD '06.

[6]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[8]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[9]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[10]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[11]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[12]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[13]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[14]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[17]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[18]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[19]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[20]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[21]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[22]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[23]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[24]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[25]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).