Randomized Primitives for Big Data Processing

A basic question on two pieces of data is: “What is the similarity of the data?” In this extended abstract we give an overview of new developments in randomized algorithms and data structures that relate to this question. In particular we provide new state of the art methods in three particular settings, that all relate to the computation of intersection sizes:1.We give a new space-efficient summary data structure for answering set intersection size queries. The new summaries are based on one-permutation min-wise hashing, and we provide a lower bound that nearly matches our new upper bound.2.For sparse matrix multiplication, we give new tight bounds in the I/O model, settling the I/O complexity a natural parameterization of the problem—namely where the complexity depends on the input sparsity N, the output sparsity Z and the parameters of the I/O model. In the RAM model we give a new algorithm that exploits output sparsity and which beats previous known results for most of the parameter space.3.We give a new I/O efficient algorithm to compute the similarity join between two sets: two elements are members of this join if they are close according to a specified metric. Our new algorithm is based on locality-sensitive hashing and strictly improves on previous work.

[1]  Rosa Meo Maximum independence and mutual information , 2002, IEEE Trans. Inf. Theory.

[2]  Marianne Winslett,et al.  Multi-resolution bitmap indexes for scientific data , 2007, TODS.

[3]  Bingsheng He,et al.  Cache-oblivious nested-loop joins , 2006, CIKM '06.

[4]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[5]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[6]  Rasmus Pagh,et al.  Better Size Estimation for Sparse Matrix Products , 2010, Algorithmica.

[7]  Mark A. Iwen,et al.  A note on compressed sensing and the complexity of matrix multiplication , 2009, Inf. Process. Lett..

[8]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[9]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[10]  Rosa Meo Theory of dependence values , 2000, TODS.

[11]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[12]  David P. Woodruff,et al.  Tight bounds for distributed functional monitoring , 2011, STOC '12.

[13]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[14]  Alan Siegel,et al.  On Universal Classes of Extremely Random Constant-Time Hash Functions , 1995, SIAM J. Comput..

[15]  Noga Alon,et al.  Finding and counting given length cycles , 1997, Algorithmica.

[16]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[17]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[18]  Andrew Chi-Chih Yao,et al.  Some complexity questions related to distributive computing(Preliminary Report) , 1979, STOC.

[19]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[20]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[21]  Mikkel Thorup,et al.  Bottom-k and priority sampling, set similarity and subset sums with minimal independence , 2013, STOC '13.

[22]  Edith Cohen,et al.  Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments , 2009, Proc. VLDB Endow..

[23]  Mark Braverman,et al.  Information Lower Bounds via Self-Reducibility , 2015, Theory of Computing Systems.

[24]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[25]  Andrea Asperti,et al.  A proof of Bertrand's postulate , 2012, J. Formaliz. Reason..

[26]  Nikolaj Tatti,et al.  Maximum entropy based significance of itemsets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[28]  Rasmus Pagh,et al.  Faster join-projects and sparse matrix multiplications , 2009, ICDT '09.

[29]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[30]  Peter Bro Miltersen,et al.  Is linear hashing good? , 1997, STOC '97.

[31]  Andrzej Lingas,et al.  A Fast Output-Sensitive Algorithm for Boolean Matrix Multiplication , 2011, Algorithmica.

[32]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[33]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[34]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[35]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[36]  Ke Yi,et al.  Beyond simple aggregates: indexing for summary queries , 2011, PODS.

[37]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[38]  Rasmus Pagh,et al.  The Input/Output Complexity of Sparse Matrix Multiplication , 2014, ESA.

[39]  Anna Pagh,et al.  Uniform Hashing in Constant Time and Optimal Space , 2008, SIAM J. Comput..

[40]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[41]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[42]  Riko Jacob,et al.  The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.

[43]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[44]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[45]  C. SIAMJ. LOW REDUNDANCY IN STATIC DICTIONARIES WITH CONSTANT QUERY TIME , 2001 .

[46]  Alessandro Panconesi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms , 2009 .

[47]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  A. Razborov Communication Complexity , 2011 .

[49]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[50]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .

[51]  Amit Chakrabarti,et al.  An Optimal Lower Bound on the Communication Complexity of Gap-Hamming-Distance , 2012, SIAM J. Comput..

[52]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[53]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[54]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[55]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[56]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[57]  Ely Porat,et al.  Fast set intersection and two-patterns matching , 2009, Theor. Comput. Sci..

[58]  Rasmus Pagh,et al.  Generating k-Independent Variables in Constant Time , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[59]  Vijay V. Vazirani,et al.  Matching is as easy as matrix inversion , 1987, STOC.

[60]  Mikkel Thorup,et al.  On the k-Independence Required by Linear Probing and Minwise Independence , 2010, TALG.

[61]  Mathias Bæk Tejs Knudsen,et al.  Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence , 2015, ESA.

[62]  Mikkel Thorup,et al.  Tabulation Based 5-Universal Hashing and Linear Probing , 2010, ALENEX.

[63]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[64]  Nikolaj Tatti,et al.  Computational complexity of queries based on itemsets , 2006, Inf. Process. Lett..

[65]  T. S. Jayram Information complexity: a tutorial , 2010, PODS '10.

[66]  Rasmus Pagh,et al.  The input/output complexity of triangle enumeration , 2013, PODS.

[67]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[68]  Timothy M. Chan Speeding up the Four Russians Algorithm by About One More Logarithmic Factor , 2015, SODA.

[69]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[70]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[71]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[72]  Robert S. Boyer,et al.  MJRTY: A Fast Majority Vote Algorithm , 1991, Automated Reasoning: Essays in Honor of Woody Bledsoe.

[73]  Desh Ranjan,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[74]  Ping Li,et al.  One Permutation Hashing for Efficient Search and Learning , 2012, ArXiv.

[75]  Rasmus Pagh,et al.  Compressed matrix multiplication , 2011, ITCS '12.

[76]  Riko Jacob,et al.  Fast Output-Sensitive Matrix Multiplication , 2015, ESA.

[77]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[78]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[79]  Martti Penttonen,et al.  A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[80]  Edith Cohen,et al.  Leveraging discarded samples for tighter estimation of multiple-set aggregates , 2009, SIGMETRICS '09.

[81]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[82]  Mikkel Thorup,et al.  Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[83]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[84]  Gerth Stølting Brodal,et al.  Cache-Oblivious Algorithms and Data Structures , 2004, SWAT.

[85]  A. J. Stothers On the complexity of matrix multiplication , 2010 .

[86]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[87]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[88]  V. Strassen Gaussian elimination is not optimal , 1969 .

[89]  Rasmus Pagh,et al.  I/O-Efficient Similarity Join , 2015, ESA.

[90]  A. Joffe On a Set of Almost Deterministic $k$-Independent Random Variables , 1974 .

[91]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[92]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[93]  Philip Bille,et al.  Fast Evaluation of Union-Intersection Expressions , 2007, ISAAC.

[94]  Mikkel Thorup,et al.  The power of simple tabulation hashing , 2010, STOC.

[95]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[96]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[97]  Gero Greiner,et al.  Sparse Matrix Computations and their I/O Complexity , 2012 .

[98]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[99]  Edith Cohen,et al.  Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[100]  Edith Cohen,et al.  Structure Prediction and Computation of Sparse Matrix Products , 1998, J. Comb. Optim..

[101]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[102]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[103]  David P. Woodruff,et al.  Is min-wise hashing optimal for summarizing set intersection? , 2014, PODS.

[104]  Ryan Williams,et al.  Finding orthogonal vectors in discrete structures , 2014, SODA.

[105]  Mikkel Thorup Even strongly universal hashing is pretty fast , 2000, SODA '00.

[106]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[107]  S. Dongen Graph clustering by flow simulation , 2000 .

[108]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[109]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[110]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[111]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[112]  Anna Pagh,et al.  Linear probing with constant independence , 2006, STOC '07.

[113]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[114]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[115]  Vijay V. Vazirani,et al.  Maximum Matchings in General Graphs Through Randomization , 1989, J. Algorithms.

[116]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[117]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[118]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[119]  Rasmus Pagh,et al.  Association Rule Mining using Maximum Entropy , 2015, ArXiv.

[120]  Noam Nisan,et al.  On Randomized One-round Communication Complexity , 1995, STOC '95.

[121]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..