What You Can Do with Coordinated Samples

Sample coordination, where similar instances have similar samples, was proposed by statisticians four decades ago as a way to maximize overlap in repeated surveys. Coordinated sampling had been since used for summarizing massive data sets. The usefulness of a sampling scheme hinges on the scope and accuracy within which queries posed over the original data can be answered from the sample. We aim here to gain a fundamental understanding of the limits and potential of coordination. Our main result is a precise characterization, in terms of simple properties of the estimated function, of queries for which estimators with desirable properties exist. We consider unbiasedness, nonnegativity, finite variance, and bounded estimates. Since generally a single estimator can not be optimal (minimize variance simultaneously) for all data, we propose {\em variance competitiveness}, which means that the expectation of the square on any data is not too far from the minimum one possible for the data. Surprisingly perhaps, we show how to construct, for any function for which an unbiased nonnegative estimator exists, a variance competitive estimator.

[1]  Edith Cohen,et al.  Leveraging discarded samples for tighter estimation of multiple-set aggregates , 2009, SIGMETRICS '09.

[2]  S. Janson Stable distributions , 2011, 1112.0220.

[3]  Edith Cohen,et al.  Spatially-decaying aggregation over a network: model and algorithms , 2004, SIGMOD '04.

[4]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[5]  Cohen Yi-Min Wang Gaurav Suri When Piecewise Determinism Is Almost TrueEdith , 1995 .

[6]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[7]  K. Brewer,et al.  SELECTING SEVERAL SAMPLES FROM A SINGLE POPULATION , 1972 .

[8]  B. Rosén Asymptotic theory for order sampling , 1997 .

[9]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[10]  Edith Cohen,et al.  Spatially-decaying aggregation over a network , 2007, J. Comput. Syst. Sci..

[11]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[13]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[14]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[15]  Pedro J. Saavedra,et al.  FIXED SAMPLE SIZE PPS APPROXIMATIONS WITH A PERMANENT RANDOM NUMBER , 2002 .

[16]  J. Hájek,et al.  Sampling from a finite population , 1982 .

[17]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[18]  B. Rosén Asymptotic Theory for Successive Sampling with Varying Probabilities Without Replacement, II , 1972 .

[19]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[20]  E. Ohlsson Coordination of Pps Samples over Time , 2000 .

[21]  Devavrat Shah,et al.  Computing separable functions via gossip , 2005, PODC '06.

[22]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[23]  Edith Cohen,et al.  Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments , 2009, Proc. VLDB Endow..

[24]  Edith Cohen,et al.  A Case for Customizing Estimators: Coordinated Samples , 2012, ArXiv.

[25]  EsbjoÈrn Ohlsson Sequential Poisson Sampling , 1999 .

[26]  J. Lanke,et al.  On UMV-EStimators in survey sampling , 1973 .

[27]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[28]  Edith Cohen,et al.  Get the most out of your sample: optimal unbiased estimators using partial information , 2011, PODS.

[29]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[30]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[31]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[32]  Jeffrey Considine,et al.  Informed content delivery across adaptive overlay networks , 2002, IEEE/ACM Transactions on Networking.

[33]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[34]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[35]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[36]  Edith Cohen,et al.  Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[37]  Edith Cohen,et al.  Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths , 2013, COSN '13.

[38]  Edith Cohen,et al.  How to Estimate Change from Samples , 2012, ArXiv.

[39]  Edith Cohen,et al.  All-Distances Sketches, Revisited: Scalable Estimation of the Distance Distribution and Centralities in Massive Graphs , 2013 .

[40]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.