RHist: adaptive summarization over continuous data streams

Maintaining approximate aggregates and summaries over data streams is crucial to handle the OLAP query workload that arises in applications, such as network monitoring and telecommunications. Furthermore, since the entire data is not available at all times the maintenance task must be done incrementally. We show that R(elaxed)Hist(ogram) is an appropriate summarization under data stream scenario. In order to reduce query estimation errors, we propose adaptive approaches which not only capture the data distribution, but also integrate independent query patterns. We introduce a workload decay model to efficiently capture global workload information and ensure that the query patterns from the recent past are weighted more than queries that are further in the past. We verify experimentally that our approach successfully adapts to continuously changing workload as well as data streams.

[1]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[2]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[3]  Divyakant Agrawal,et al.  Query estimation by adaptive sampling , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[5]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[6]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[7]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[8]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[9]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[10]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[11]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[12]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[13]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[14]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[15]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[16]  Viswanath Poosala Histogram-Based Estimation Techniques in Database Systems , 1997 .

[17]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[18]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[19]  Divyakant Agrawal,et al.  Applying the golden rule of sampling for query estimation , 2001, SIGMOD '01.

[20]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[22]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[23]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[24]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[25]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[26]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries (extended abstract) , 2000, PODS '00.