A Framework for Exploiting Local Information to Enhance Density Estimation of Data Streams

The Probability Density Function (PDF) is the fundamental data model for a variety of stream mining algorithms. Existing works apply the standard nonparametric Kernel Density Estimator (KDE) to approximate the PDF of data streams. As a result, the stream-based KDEs cannot accurately capture complex local density features. In this article, we propose the use of Local Region (LRs) to model local density information in univariate data streams. In-depth theoretical analyses are presented to justify the effectiveness of the LR-based KDE. Based on the analyses, we develop the General Local rEgion AlgorithM (GLEAM) to enhance the estimation quality of structurally complex univariate distributions for existing stream-based KDEs. A set of algorithmic optimizations is designed to improve the query throughput of GLEAM and to achieve its linear order computation. Additionally, a comprehensive suite of experiments was conducted to test the effectiveness and efficiency of GLEAM.

[1]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[2]  Bernhard Seeger,et al.  Towards Kernel Density Estimation over Streaming Data , 2006, COMAD.

[3]  Robert A. Lordo,et al.  Nonparametric and Semiparametric Models , 2005, Technometrics.

[4]  C. J. Stone,et al.  An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates , 1984 .

[5]  M. C. Jones,et al.  Locally parametric nonparametric density estimation , 1996 .

[6]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[7]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[8]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[9]  C. Loader Local Likelihood Density Estimation , 1996 .

[10]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[11]  Nils-Bastian Heidenreich,et al.  Bandwidth Selection Methods for Kernel Density Estimation - A Review of Performance , 2010 .

[12]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[13]  P. V. Kerm,et al.  Adaptive kernel density estimation , 2003 .

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Tian Zhang,et al.  Fast density estimation using CF-kernel for very large databases , 1999, KDD '99.

[16]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[17]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[18]  H. Kile,et al.  Bandwidth Selection in Kernel Density Estimation , 2010 .

[19]  A. Bowman An alternative method of cross-validation for the smoothing of density estimates , 1984 .

[20]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[21]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[22]  M. C. Jones,et al.  On optimal data-based bandwidth selection in kernel density estimation , 1991 .

[23]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[24]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[25]  Jun Yan,et al.  Kernel Density Estimation of traffic accidents in a network space , 2008, Comput. Environ. Urban Syst..

[26]  Andrew W. Moore,et al.  Rapid Evaluation of Multiple Density Models , 2003, AISTATS.

[27]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[28]  C. Loader Bandwidth selection: classical or plug-in? , 1999 .

[29]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[30]  J. S. Long,et al.  Adaptive kernel density estimation , 2007 .

[31]  Brian Kent Aldershof,et al.  Estimation of integrated squared density derivatives , 1991 .

[32]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[33]  J. Simonoff Multivariate Density Estimation , 1996 .

[34]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[35]  George Henry Dunteman,et al.  Introduction To Multivariate Analysis , 1984 .

[36]  David J. Marchette,et al.  On Some Techniques for Streaming Data: A Case Study of Internet Packet Headers , 2003 .

[37]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[38]  Arnold P. Boedihardjo,et al.  A framework for estimating complex probability density structures in data streams , 2008, CIKM '08.

[39]  Bernhard Seeger,et al.  Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[40]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[41]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[42]  Young K. Truong,et al.  On bandwidth choice for density estimation with dependent data , 1995 .

[43]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[44]  Ioannis Mitliagkas,et al.  Streaming, Memory-limited PCA , 2013 .

[45]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[46]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[47]  Atsuyuki Okabe,et al.  A kernel density estimation method for networks, its computational method and a GIS‐based tool , 2009, Int. J. Geogr. Inf. Sci..

[48]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.