Sketch Techniques for Approximate Query Processing

Sketch techniques have undergone extensive development within the past few years. They are especially appropriate for the data streaming scenario, in which large quantities of data flow by and the the sketch summary must continually be updated quickly and compactly. Sketches, as presented here, are designed so that the update caused by each new piece of data is largely independent of the current state of the summary. This design choice makes them faster to process, and also easy to parallelize. “Frequency based sketches” are concerned with summarizing the observed frequency distribution of a dataset. From these sketches, accurate estimations of individual frequencies can be extracted. This leads to algorithms to find the approximate heavy hitters (items which account for a large fraction of the frequency mass) and quantiles (the median and its generalizations). The same sketches are also used to estimate (equi)join sizes between relations, self-join sizes and range queries. These can be used as primitives within more complex mining operations, and to extract wavelet and histogram representations of streaming data. A different style of sketch construction leads to sketches for distinct-value queries. As mentioned above, using a sample to estimate the answer to a COUNT DISTINCT query does not give accurate results. In contract, sketching methods which can make a pass over the whole data can provide guaranteed accuracy. Once built, these sketches estimate not only the cardinality of a given attribute or combination of attributes, but also the cardinality of various operations performed on them, such as set operations (union and difference), and selections based on arbitrary predicates.

[1]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[2]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[3]  Yin Zhang,et al.  Improving sketch reconstruction accuracy using linear least squares method , 2005, IMC '05.

[4]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[5]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[6]  A. Razborov Communication Complexity , 2011 .

[7]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[8]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[9]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[10]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[11]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[12]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[13]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[14]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[15]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[16]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[17]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[18]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[19]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[20]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[21]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[22]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[23]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[24]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[25]  Ravi Kumar,et al.  The One-Way Communication Complexity of Hamming Distance , 2008, Theory Comput..

[26]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[27]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[28]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[29]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[30]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[31]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[32]  Feifei Li,et al.  Randomized Synopses for Query Assurance on Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[33]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[34]  Jin Cao,et al.  A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[35]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[36]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[37]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[38]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[39]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[40]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[41]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[42]  Srikanta Tirthapura,et al.  Range-Efficient Counting of Distinct Elements in a Massive Data Stream , 2007, SIAM J. Comput..

[43]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[44]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[45]  Sumit Ganguly,et al.  CR-precis: A Deterministic Summary Structure for Update Data Streams , 2006, ESCAPE.

[46]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[47]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[48]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[49]  Abhinandan Das,et al.  Approximation techniques for spatial data , 2004, SIGMOD '04.

[50]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[51]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[52]  Mikkel Thorup Even strongly universal hashing is pretty fast , 2000, SODA '00.

[53]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[54]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[55]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[56]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[57]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[58]  Gregory T. Byrd,et al.  High-throughput sketch update on a low-power stream processor , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[59]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[60]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[61]  Jeffrey Considine,et al.  Robust approximate aggregation in sensor data management systems , 2009, TODS.

[62]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[63]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[64]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[65]  Fan Deng New Estimation Algorithms for Streaming Data : Count-min Can Do More , 2022 .

[66]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[67]  Andrea Montanari,et al.  Counter braids: a novel counter architecture for per-flow measurement , 2008, SIGMETRICS '08.

[68]  Philip S. Yu,et al.  On Efficient Query Processing of Stream Counts on the Cell Processor , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[69]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[70]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[71]  Carsten Lund,et al.  Estimating flow distributions from sampled flow statistics , 2005, TNET.

[72]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[73]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[74]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[75]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.