Summarizing data using bottom-k sketches

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value. Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates. An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches. Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).

[1]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[2]  Devavrat Shah,et al.  Computing separable functions via gossip , 2005, PODC '06.

[3]  X. Liy Dynamic Algorithms in Computational Geometry , 2007 .

[4]  Carsten Lund,et al.  Flow sampling under hard resource constraints , 2004, SIGMETRICS '04/Performance '04.

[5]  Haim Kaplan,et al.  Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting , 2006, SODA '06.

[6]  Edith Cohen,et al.  Spatially-decaying aggregation over a network: model and algorithms , 2004, SIGMOD '04.

[7]  Edith Cohen,et al.  Bottom-k sketches: better and more efficient estimation of aggregates , 2007, SIGMETRICS '07.

[8]  Cohen Yi-Min Wang Gaurav Suri When Piecewise Determinism Is Almost TrueEdith , 1995 .

[9]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[10]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[11]  Edith Cohen,et al.  Efficient estimation algorithms for neighborhood variance and other moments , 2004, SODA '04.

[12]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[13]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[14]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[15]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[16]  Robert E. Tarjan,et al.  Making data structures persistent , 1986, STOC '86.

[17]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[18]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[19]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[20]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[21]  Edith Cohen,et al.  Spatially-decaying aggregation over a network , 2007, J. Comput. Syst. Sci..

[22]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[23]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[24]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..