Measuring independence of datasets

Approximating pairwise, or k-wise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of k-tuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model. Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log{n}-approximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log n-approximation for the statistical distance metric. In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (ε, δ)-approximation of the statistical distance between the joint and product distributions defined by a stream of k-tuples. Our algorithm requires O((1/ε log(nm/δ))(30+k)k) memory and a single pass over the data stream.

[1]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[2]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[3]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[4]  Atri Rudra,et al.  Lower bounds for randomized read/write stream algorithms , 2007, STOC '07.

[5]  David P. Woodruff,et al.  The communication and streaming complexity of computing the longest common and increasing subsequences , 2007, SODA '07.

[6]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[7]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.

[8]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[9]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2009, J. Comput. Syst. Sci..

[10]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[11]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[12]  David Eppstein,et al.  Deterministic sampling and range counting in geometric data streams , 2004, SCG '04.

[13]  Amit Sahai,et al.  Manipulating statistical difference , 1997, Randomization Methods in Algorithm Design.

[14]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[15]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[16]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[17]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[18]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[19]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[20]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[21]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[22]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[23]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[24]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[25]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[26]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[27]  Noga Alon,et al.  Almost k-wise independence versus k-wise independence , 2003, Information Processing Letters.

[28]  Ping Li,et al.  Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections , 2008, SODA '08.

[29]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[30]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[31]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[32]  Rafail Ostrovsky,et al.  Effective Computations on Sliding Windows , 2010, SIAM J. Comput..

[33]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[34]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[35]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[36]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[37]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[38]  Joan Feigenbaum,et al.  Graph distances in the streaming model: the value of space , 2005, SODA '05.

[39]  Rocco A. Servedio,et al.  Testing monotone high-dimensional distributions , 2005, STOC '05.

[40]  Marios Hadjieleftheriou,et al.  Finding frequent items in data streams , 2008, Proc. VLDB Endow..

[41]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[42]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[43]  Ping Li,et al.  Compressed counting , 2008, SODA.

[44]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[45]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[46]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[47]  IndykPiotr Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006 .

[48]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[49]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[50]  Rafail Ostrovsky,et al.  AMS Without 4-Wise Independence on Product Domains , 2008, STACS.

[51]  Rafail Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[52]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[53]  Rafail Ostrovsky,et al.  Measuring $k$-Wise Independence of Streaming Data , 2008, ArXiv.

[54]  David Eppstein,et al.  Deterministic sampling and range counting in geometric data streams , 2003, TALG.