A Survey of Synopsis Construction in Data Streams

The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query processing techniques in data stream processing. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In this chapter, we will provide a survey of the key synopsis techniques, and the mining techniques supported by such methods. We will discuss the challenges and tradeoffs associated with using different kinds of techniques, and the important research directions for synopsis construction.

[1]  Bongki Moon,et al.  SketchTree: Approximate Tree Pattern Counts over Streaming Labeled Trees , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Torsten Suel,et al.  On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[3]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[4]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[5]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[6]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[7]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[8]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[10]  Sudipto Guha,et al.  XWAVE: Approximate Extended Wavelets for Streaming Data , 2004, VLDB.

[11]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[12]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[13]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[15]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[16]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[17]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[18]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2003, J. Algorithms.

[19]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[20]  Daniel A. Keim,et al.  Wavelets and their Applications in Databases , 2001, IEEE International Conference on Data Engineering.

[21]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[22]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[23]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[24]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[25]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[26]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[27]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[28]  Keishi Tajima,et al.  SIGMOD Conference 2002 , 2002 .

[29]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[30]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[31]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[32]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[33]  Amit Kumar,et al.  Join-distinct aggregate estimation over update streams , 2005, PODS '05.

[34]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[35]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[36]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[37]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[38]  Jeffrey Considine,et al.  Robust Aggregation in Sensor Networks , 2005, IEEE Data Eng. Bull..

[39]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[40]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[41]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[42]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[43]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[44]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[45]  Sudipto Guha,et al.  Space Efficiency in Synopsis Construction Algorithms , 2005, VLDB.

[46]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[47]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[48]  Amit Kumar,et al.  Deterministic wavelet thresholding for maximum-error metrics , 2004, PODS.

[49]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[50]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[51]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[52]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[53]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[54]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[55]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[56]  Abhinandan Das,et al.  Distributed Set Expression Cardinality Estimation , 2004, VLDB.

[57]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[58]  Yan Chen,et al.  Reversible sketches for efficient and accurate change detection over network data streams , 2004, IMC '04.

[59]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[60]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[61]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[62]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[63]  Neoklis Polyzotis,et al.  Structure and Value Synopses for XML Data Graphs , 2002, VLDB.

[64]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[65]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[66]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[67]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[68]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[69]  S. Muthukrishnan,et al.  Workload-Optimal Histograms on Streams , 2005, ESA.

[70]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[71]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[72]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[73]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[74]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[75]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[76]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[77]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[78]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[79]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[80]  Nikos Mamoulis,et al.  One-Pass Wavelet Synopses for Maximum-Error Metrics , 2005, VLDB.

[81]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[82]  Yannis E. Ioannidis,et al.  Approximate Query Answering using Histograms , 1999, IEEE Data Eng. Bull..

[83]  Yossi Matias,et al.  Optimal workload-based weighted wavelet synopses , 2005, Theor. Comput. Sci..

[84]  Neoklis Polyzotis,et al.  XCluster Synopses for Structured XML Content , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[85]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[86]  Sudipto Guha,et al.  Histogramming Data Streams with Fast Per-Item Processing , 2002, ICALP.

[87]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[88]  李幼升,et al.  Ph , 1989 .

[89]  Charu C. Aggarwal On Futuristic Query Processing in Data Streams , 2006, EDBT.

[90]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[91]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[92]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[93]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[94]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.