JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation

Inner-product estimation is the base of many important tasks in a variety of big data scenarios, including measuring similarity of streams in data stream processing, estimating join size in database, and analyzing cosine similarity in various applications. Sketch, as a class of probability algorithms, is promising in inner-product estimation. However, existing sketch solutions suffer from low accuracy due to their neglect of the high skewness of real data. In this paper, we design a new sketch algorithm for accurate and unbiased inner-product estimation, namely JoinSketch. To improve accuracy, JoinSketch consists of multiple components, and records items with different frequency in different components. We theoretically prove that JoinSketch is unbiased, and has lower variance compared with the well-known AGMS and Fast-AGMS sketch. The experimental results show that JoinSketch improves the accuracy by 10 times in average while maintaining a comparable speed. All code is open-sourced at Github.

[1]  Shigang Chen,et al.  Pyramid Family: Generic Frameworks for Accurate and Fast Flow Size Measurement , 2022, IEEE/ACM Transactions on Networking.

[2]  M. Balazinska,et al.  Degree Sequence Bound For Join Cardinality Estimation , 2022, ICDT.

[3]  Junchen Jiang,et al.  Precise error estimation for sketch-based flow measurement , 2021, Internet Measurement Conference.

[4]  Zirui Liu,et al.  SketchINT: Empowering INT with TowerSketch for Per-flow Per-switch Measurement , 2021, 2021 IEEE 29th International Conference on Network Protocols (ICNP).

[5]  Wenfei Wu,et al.  DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing , 2021, KDD.

[6]  Florin Rusu,et al.  COMPASS: Online Sketch-based Query Optimization for In-Memory Databases , 2021, SIGMOD Conference.

[7]  Tong Yang,et al.  Out of Many We are One: Measuring Item Batch with Clock-Sketch , 2021, SIGMOD Conference.

[8]  Tong Yang,et al.  BurstSketch: Finding Bursts in Data Streams , 2021, IEEE Transactions on Knowledge and Data Engineering.

[9]  Michael Mitzenmacher,et al.  SALSA: Self-Adjusting Lean Streaming Analytics , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[10]  Zhetao Li,et al.  On-Off Sketch: A Fast and Accurate Sketch on Persistence , 2020, Proc. VLDB Endow..

[11]  Chee-Yong Chan,et al.  Improved Correlated Sampling for Join Size Estimation , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[12]  Christian Timmerer,et al.  Automating QoS and QoE Evaluation of HTTP Adaptive Streaming Systems , 2019 .

[13]  Roy Friedman,et al.  Nitrosketch: robust and general sketch-based monitoring in software switches , 2019, SIGCOMM.

[14]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.

[15]  P. Boncz,et al.  Query optimization through the looking glass, and what we found running the Join Order Benchmark , 2018, The VLDB Journal.

[16]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[17]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[18]  Daniel Ting,et al.  Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation , 2017, SIGMOD Conference.

[19]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[20]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[21]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[22]  Mohamed Ahmed,et al.  Weighted Similarity Estimation in Data Streams , 2015, CIKM.

[23]  David Vengerov,et al.  Join Size Estimation Subject to Filter Conditions , 2015, Proc. VLDB Endow..

[24]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[25]  Wen-Chi Hou,et al.  CS2: a new database synopsis for query estimation , 2013, SIGMOD '13.

[26]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[27]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[28]  A. Gupta,et al.  Reversible Sketches: Enabling Monitoring and Analysis Over High-Speed Data Streams , 2007, IEEE/ACM Transactions on Networking.

[29]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[30]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[31]  Jeffrey F. Naughton,et al.  End-biased Samples for Join Cardinality Estimation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Sumit Ganguly,et al.  Practical Algorithms for Tracking Database Join Sizes , 2005, FSTTCS.

[33]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[34]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[35]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[36]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[37]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[38]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[39]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[40]  Phillip B. Gibbons,et al.  Tracking join and self-join sizes in limited storage , 1999, J. Comput. Syst. Sci..

[41]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[42]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[43]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[44]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[45]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[46]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[47]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[48]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[49]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[50]  Haoyu Li,et al.  Stingy Sketch: A Sketch Framework for Accurate and Fast Frequency Estimation , 2022, Proc. VLDB Endow..

[51]  Meikel Poess TPC-DS , 2019, Encyclopedia of Big Data Technologies.

[52]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[53]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[54]  Rajeev Rastogi,et al.  Sketch-Based Multi-Query Processing over Data Streams , 2004, Data Stream Management.