AMS Without 4-Wise Independence on Product Domains

In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that $4$-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains $[n]^k$ by using the product of $4$-wise independent functions on $[n]$. Our work extends that of Indyk and McGregor, who showed the result for $k = 2$. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs $(i,j) \in [n]^2$ arrive, giving a joint distribution $(X,Y)$, and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close $X$ and $Y$ are to being independent. By using our technique, we obtain a new result for the problem of approximating the $\ell_2$ distance between the joint distribution and the product of the marginal distributions for $k$-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a $(1\pm \epsilon)$ approximation (with probability $1-\delta$) that requires space logarithmic in $n$ and $m$ and proportional to $3^k$.

[1]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[2]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[3]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[4]  Rafail Ostrovsky,et al.  Measuring $k$-Wise Independence of Streaming Data , 2008, ArXiv.

[5]  Rafail Ostrovsky,et al.  Measuring independence of datasets , 2009, STOC '10.

[6]  Gustavo Alonso,et al.  XTream: personal data streams , 2007, SIGMOD '07.

[7]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[8]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[9]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[10]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[11]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[12]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[13]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[14]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[15]  Seshadhri Comandur,et al.  An Expansion Tester for Bounded Degree Graphs , 2011, SIAM J. Comput..

[16]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[17]  Oded Goldreich,et al.  Computational Indistinguishability: A Sample Hierarchy , 1999, J. Comput. Syst. Sci..

[18]  Nicole Schweikardt,et al.  Tight lower bounds for query processing on streaming and external memory data , 2005, Theor. Comput. Sci..

[19]  Dana Ron,et al.  Property Testing in Bounded Degree Graphs , 2002, STOC '97.

[20]  Funda Ergün,et al.  On distance to monotonicity and longest increasing subsequence of a data stream , 2008, SODA '08.

[21]  Toon Calders,et al.  Mining Frequent Itemsets in a Stream , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[22]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[23]  M. Mitzenmacher,et al.  Testing k-Wise Independence over Streaming Data , 2009 .

[24]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[25]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[27]  David P. Woodruff,et al.  The communication and streaming complexity of computing the longest common and increasing subsequences , 2007, SODA '07.

[28]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[29]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[30]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[31]  Joan Feigenbaum,et al.  Graph distances in the streaming model: the value of space , 2005, SODA '05.

[32]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[33]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[34]  Oded Goldreich,et al.  Computational indistinguishability: a sample hierarchy , 1998, Proceedings. Thirteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat. No.98CB36247).

[35]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[36]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS '06.

[37]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[38]  Rocco A. Servedio,et al.  Testing monotone high-dimensional distributions , 2005, STOC '05.

[39]  Ravi Kannan,et al.  The space complexity of pass-efficient algorithms for clustering , 2006, SODA '06.

[40]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[41]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[42]  Noga Alon,et al.  A Fast and Simple Randomized Parallel Algorithm for the Maximal Independent Set Problem , 1985, J. Algorithms.

[43]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[44]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[45]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[46]  Csaba D. Tóth,et al.  Range counting over multidimensional data streams , 2004, SCG '04.

[47]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[48]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[49]  Sudipto Guha,et al.  Sketching Information Divergences , 2007, COLT.

[50]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[51]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[52]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[53]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[54]  Noga Alon,et al.  Almost k-wise independence versus k-wise independence , 2003, Information Processing Letters.