A framework for estimating complex probability density structures in data streams

Probability density function estimation is a fundamental component in several stream mining tasks such as outlier detection and classification. The nonparametric adaptive kernel density estimate (AKDE) provides a robust and asymptotically consistent estimate for an arbitrary distribution. However, its extensive computational requirements make it difficult to apply this technique to the stream environment. This paper tackles the issue of developing efficient and asymptotically consistent AKDE over data streams while heeding the stringent constraints imposed by the stream environment. We propose the concept of local regions to effectively synopsize local density features, design a suite of algorithms to maintain the AKDE under a time-based sliding window, and analyze the estimates' asymptotic consistency and computational costs. In addition, extensive experiments were conducted with real-world and synthetic data sets to demonstrate the effectiveness and efficiency of our approach.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  David J. Marchette,et al.  On Some Techniques for Streaming Data: A Case Study of Internet Packet Headers , 2003 .

[3]  Christoph Heinz,et al.  Density estimation over data streams , 2007 .

[4]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Tian Zhang,et al.  Fast density estimation using CF-kernel for very large databases , 1999, KDD '99.

[6]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[7]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[8]  Bernhard Seeger,et al.  Towards Kernel Density Estimation over Streaming Data , 2006, COMAD.

[9]  Pravin Varaiya,et al.  Freeway performance measurement system (pems) , 2002 .

[10]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[11]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[12]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[13]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[14]  Andrew W. Moore,et al.  Rapid Evaluation of Multiple Density Models , 2003, AISTATS.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[17]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[18]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[19]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[20]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[21]  D. W. Scott,et al.  On Locally Adaptive Density Estimation , 1996 .

[22]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[23]  J. Simonoff Multivariate Density Estimation , 1996 .