Statistical analysis of sketch estimators

Sketching techniques can provide approximate answers to aggregate queries either for data-streaming or distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this paper, we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. This makes Fast-AGMS sketches the preferred choice irrespective of the situation.

[1]  D. Bonett,et al.  Estimating the variance of the sample median , 2001 .

[2]  David J. Olive A Simple Confidence Interval for the Median , 2005 .

[3]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[4]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[5]  Sumit Ganguly,et al.  Practical Algorithms for Tracking Database Join Sizes , 2005, FSTTCS.

[6]  Jeffrey F. Naughton,et al.  End-biased Samples for Join Cardinality Estimation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[8]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[9]  Florin Rusu,et al.  Fast range-summable random variables for efficient aggregate estimation , 2006, SIGMOD Conference.

[10]  K. Balanda,et al.  Kurtosis: A Critical Review , 1988 .

[11]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[12]  F. Pennecchi,et al.  Between the mean and the median: the Lp estimator , 2007 .

[13]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[14]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[15]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[16]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[17]  Abhinandan Das,et al.  Approximation techniques for spatial data , 2004, SIGMOD '04.

[18]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.