Fast computation of approximate biased histograms on sliding windows over data streams

Histograms provide effective synopses of large data sets, and are thus used in a wide variety of applications, including query optimization, approximate query answering, distribution fitting, parallel database partitioning, and data mining. Moreover, very fast approximate algorithms are needed to compute accurate histograms on fast-arriving data streams, whereby online queries can be supported within the given memory and computing resources. Many real-life applications require that the data distribution in certain regions must be modeled with greater accuracy, and Biased Histograms are designed to address this need. In this paper, we define biased histograms over data streams and sliding windows on data streams, and propose the Bar Splitting Biased Histogram (BSBH) algorithm to construct them efficiently and accurately. We prove that BSBH generates expected ∈-approximate biased histograms for data streams with stationary distributions, and our experiments show that BSBH also achieves good approximation in the presence of concept shifts, even major ones. Additionally, BSBH employs a new biased sampling technique which outperforms uniform sampling in terms of accuracy, while using about the same amount of time and memory. Therefore, BSBH outperforms previously proposed algorithms for computing biased histograms over the whole data stream, and it is the first algorithm that supports windows.

[1]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS.

[2]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[3]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[4]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[5]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[6]  Qi Zhang,et al.  An efficient algorithm for approximate biased quantile computation in data streams , 2007, CIKM '07.

[7]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[8]  Jian Xu,et al.  Space-efficient Relative Error Order Sketch over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[10]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[11]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[12]  Divesh Srivastava,et al.  Effective computation of biased quantiles over data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Michael B. Greenwald,et al.  Practical Algorithms for Self Scaling Histograms or Better than Average Data Collection , 1996, Perform. Evaluation.

[14]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[15]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[16]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[17]  Hongjun Lu,et al.  Continuously maintaining quantile summaries of the most recent N elements over a data stream , 2004, Proceedings. 20th International Conference on Data Engineering.

[18]  Qi Zhang,et al.  A Fast Algorithm for Approximate Quantiles in High Speed Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[19]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[20]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[21]  Carlo Zaniolo,et al.  Fast and accurate computation of equi-depth histograms over data streams , 2011, EDBT/ICDT '11.