Estimating Sum by Weighted Sampling

We study the classic problem of estimating the sum of n variables. The traditional uniform sampling approach requires a linear number of samples to provide any non-trivial guarantees on the estimated sum. In this paper we consider various sampling methods besides uniform sampling, in particular sampling a variable with probability proportional to its value, referred to as linear weighted sampling. If only linear weighted sampling is allowed, we show an algorithm for estimating sum with O(√n) samples, and it is almost optimal in the sense that Ω(√n) samples are necessary for any reasonable sum estimator. If both uniform sampling and linear weighted sampling are allowed, we show a sum estimator with O(3√n) samples. More generally, we may allow general weighted sampling where the probability of sampling a variable is proportional to any function of its value. We prove a lower bound of Ω(3√n) samples for any reasonable sum estimator using general weighted sampling, which implies that our algorithm combining uniform and linear weighted sampling is an almost optimal sum estimator.

[1]  Giles,et al.  Searching the world wide Web , 1998, Science.

[2]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[3]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[4]  Ran Canetti,et al.  Lower Bounds for Sampling Algorithms for Estimating the Average , 1995, Inf. Process. Lett..

[5]  Carsten Lund,et al.  Learn more, sample less: control of volume and variance in network measurement , 2005, IEEE Transactions on Information Theory.

[6]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[7]  Jun S. Liu,et al.  Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[8]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[9]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[10]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[11]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[12]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[13]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[14]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[15]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[16]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.