Estimating arbitrary subset sums with few probes

Suppose we have a large table <i>T</i> of items <i>i</i>, each with a weight <i>w<inf>i</inf></i>, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to estimate the sum of an arbitrary subset <i>I</i> ⊆ <i>T.</i> For any <i>q</i> > 2, considering only the <i>q</i> highest priority items from <i>I</i>, we obtain an unbiased estimator of the sum whose relative standard deviation is <i>O</i>(1/√<i>q</i>). Thus to get an expected approximation factor of 1 ± ε, it suffices to consider <i>O</i>(1/±ε<sup>2</sup>) items from <i>I.</i> Our estimator needs no knowledge of the number of items in the subset <i>I</i>, but we can also estimate that number if we want to estimate averages.The above scheme performs the same role as the on-line aggregation of Hellerstein et al. (SIGMOD'97) but it has the advantage of having expected good performance for any possible sequence of weights. In particular, the performance does not deteriorate in the common case of heavy-tailed weight distributions. This point is illustrated experimentally both with real and synthetic data.We will also show that our approach can be used to improve Cohen's size estimation framework (FOCS'94).

[1]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[2]  M. P. Singh Sampling with unequal probabilities , 1986 .

[3]  Carsten Lund,et al.  Flow sampling under hard resource constraints , 2004, SIGMETRICS '04/Performance '04.

[4]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[5]  Carsten Lund,et al.  Learn more, sample less: control of volume and variance in network measurement , 2005, IEEE Transactions on Information Theory.

[6]  Mario Szegedy,et al.  Near optimality of the priority sampling procedure , 2005, Electron. Colloquium Comput. Complex..

[7]  Mark A. McComb A Practical Guide to Heavy Tails , 2000, Technometrics.

[8]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[9]  Lars Arge,et al.  External Memory Data Structures , 2001, ESA.

[10]  Jeffrey Scott Vitter,et al.  On two-dimensional indexability and optimal range search indexing , 1999, PODS '99.

[11]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[12]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[13]  Narayanaswamy Balakrishnan,et al.  Relations, Bounds and Approximations for Order Statistics , 1989 .

[14]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[15]  James N. Rosenau,et al.  To learn more , 2004, IEEE Potentials.

[16]  Kihong Park,et al.  On the relationship between file sizes, transport protocols, and self-similar network traffic , 1996, Proceedings of 1996 International Conference on Network Protocols (ICNP-96).

[17]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[18]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[19]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.