Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches

Estimating the cardinality of unions and intersections of sets is a problem of interest in OLAP. Large data applications often require the use of approximate methods based on small sketches of the data. We give new estimators for the cardinality of unions and intersection and show they approximate an optimal estimation procedure. These estimators enable the improved accuracy of the streaming MinCount sketch to be exploited in distributed settings. Both theoretical and empirical results demonstrate substantial improvements over existing methods.

[1]  Marco Rosa,et al.  HyperANF: approximating the neighbourhood function of very large graphs on a budget , 2010, WWW.

[2]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[3]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[4]  Edith Cohen,et al.  All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[6]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[7]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[8]  Peter Clifford,et al.  A Statistical Analysis of Probabilistic Counting Algorithms , 2008, 0801.3552.

[9]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[10]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[11]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[12]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[13]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[14]  Daniel Ting,et al.  Streamed approximate counting of distinct elements: beating optimal batch methods , 2014, KDD.

[15]  Graham Cormode,et al.  Mergeable summaries , 2012, PODS '12.

[16]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[17]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[18]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[19]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[20]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[21]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[22]  Jin Cao,et al.  Distinct Counting with a Self-Learning Bitmap , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23]  P. Chassaing,et al.  Efficient estimation of the cardinality of large data sets , 2007, math/0701347.