Space Lower Bounds for Itemset Frequency Sketches

Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database in the sense that all sufficiently frequent itemsets and their approximate frequencies are recoverable from the sample, and the sketch size is independent of the number of rows in the original database. For many seemingly similar problems there are better sketching algorithms than uniform sampling. In this paper we show that for itemset frequency sketching this is not the case. That is, we prove that there exist classes of databases for which uniform sampling is a space optimal sketch for approximate itemset frequency analysis, up to constant or iterated-logarithmic factors.

[1]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[2]  Eric Price Optimal Lower Bound for Itemset Frequency Indicator Sketches , 2014, ArXiv.

[3]  Farid M. Ablayev,et al.  Lower Bounds for One-Way Probabilistic Communication Complexity and Their Application to Space Complexity , 1996, Theor. Comput. Sci..

[4]  Noam Nisan,et al.  Approximate Inclusion-Exclusion , 1990, STOC '90.

[5]  Andrew Wan,et al.  Faster private release of marginals on small databases , 2013, ITCS.

[6]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[7]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[8]  Shimon Kogan,et al.  Hardness of approximation of the Balanced Complete Bipartite Subgraph problem , 2004 .

[9]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[10]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[11]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[12]  Aaron Roth,et al.  Privately Releasing Conjunctions and the Statistical Query Barrier , 2013, SIAM J. Comput..

[13]  Daniel A. Spielman,et al.  Spectral Graph Theory and its Applications , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[14]  Jian Pei,et al.  Mining Condensed Frequent-Pattern Bases , 2003, Knowledge and Information Systems.

[15]  David P. Woodruff,et al.  The Communication Complexity of Distributed Set-Joins with Applications to Matrix Multiplication , 2015, PODS.

[16]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[17]  Devdatt P. Dubhashi,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[18]  Justin Thaler,et al.  Faster Algorithms for Privately Releasing Marginals , 2012, ICALP.

[19]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[20]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[21]  Todd Wareham,et al.  The Parameterized Complexity of Enumerating Frequent Itemsets , 2006, IWPEC.

[22]  M. Rudelson Row products of random matrices , 2011, 1102.1947.

[23]  Edo Liberty,et al.  Stratified Sampling Meets Machine Learning , 2016, ICML.

[24]  Anindya De,et al.  Lower Bounds in Differential Privacy , 2011, TCC.

[25]  Adam D. Smith,et al.  The price of privately releasing contingency tables and the spectra of random matrices with correlated rows , 2010, STOC '10.

[26]  Leonard J. Schulman Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010 , 2010, STOC.

[27]  David P. Woodruff,et al.  The Sketching Complexity of Graph Cuts , 2014, ArXiv.

[28]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[29]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[30]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[31]  Jonathan Ullman,et al.  Fingerprinting codes and the price of approximate differential privacy , 2013, STOC.

[32]  Jinyan Li,et al.  A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns , 2005, PKDD.

[33]  Rocco A. Servedio,et al.  Private data release via learning thresholds , 2011, SODA.

[34]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[35]  Divesh Srivastava,et al.  Accurate and efficient private release of datacubes and contingency tables , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[36]  H. Furstenberg,et al.  Products of Random Matrices , 1960 .

[37]  Aaron Roth,et al.  Privately releasing conjunctions and the statistical query barrier , 2010, STOC '11.

[38]  Rasmus Pagh,et al.  On Finding Similar Items in a Stream of Transactions , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[39]  Jørn Justesen,et al.  Class of constructive asymptotically good algebraic codes , 1972, IEEE Trans. Inf. Theory.

[40]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[41]  Desh Ranjan,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[42]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..