Declaring independence via the sketching of sketches

We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be the first to consider this natural problem. In the centralized model, we consider a stream of pairs (i,j) ∈ [n]2 whose frequencies define a joint distribution (X,Y). In the distributed model, each coordinate of the pair may appear separately in the stream. We present a range of algorithms for approximating to what extent X and Y are independent, i.e., how close the joint distribution is to the product of the marginals. We consider various measures of closeness including ℓ1, ℓ2, and the mutual information between X and Y. Our algorithms are based on "sketching sketches", i.e., composing small-space linear synopses of the distributions. Perhaps ironically, the biggest technical challenges that arise relate to ensuring that different components of our estimates are sufficiently independent.

[1]  Abhinandan Das,et al.  Efficient Approximation of Correlated Sums on Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[3]  Ravi Kannan,et al.  The space complexity of pass-efficient algorithms for clustering , 2006, SODA '06.

[4]  Sudipto Guha,et al.  Lower Bounds for Quantile Estimation in Random-Order and Multi-pass Streaming , 2007, ICALP.

[5]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[6]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[7]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[10]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[13]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[14]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[15]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[16]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[17]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[18]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[19]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[20]  Joan Feigenbaum,et al.  On graph problems in a semi-streaming model , 2005, Theor. Comput. Sci..

[21]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[22]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[23]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[24]  Sudipto Guha,et al.  Sketching information divergences , 2007, Machine Learning.

[25]  KhannaSanjeev,et al.  Space-efficient online computation of quantile summaries , 2001 .

[26]  Joan Feigenbaum,et al.  Graph distances in the streaming model: the value of space , 2005, SODA '05.

[27]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS '06.