Weighted Reservoir Sampling from Distributed Streams

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream may have only a few heavy items which may dominate a random sample when chosen with replacement. Weighted samplingwithout replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for trackingheavy hitters with residual error. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a $łog(1/\eps)$ factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed $L_1$ tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.

[1]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[2]  Piotr Indyk,et al.  Space-optimal heavy hitters with strong error bounds , 2009, PODS.

[3]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[4]  Haikady N. Nagaraja Order Statistics from Independent Exponential Random Variables and the Sum of the Top Order Statistics , 2006 .

[5]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[6]  Qin Zhang,et al.  Continuous sampling from distributed streams , 2012, JACM.

[7]  Qin Zhang,et al.  Improved Algorithms for Distributed Entropy Monitoring , 2014, Algorithmica.

[8]  Qin Zhang,et al.  Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks , 2011, PODS '12.

[9]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[10]  David P. Woodruff,et al.  A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream , 2016, IEEE Transactions on Knowledge and Data Engineering.

[11]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[12]  David P. Woodruff,et al.  When distributed computation is communication expensive , 2013, Distributed Computing.

[13]  Carsten Lund,et al.  Flow sampling under hard resource constraints , 2004, SIGMETRICS '04/Performance '04.

[14]  Srikanta Tirthapura Distinct Random Sampling from a Distributed Stream , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[15]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[16]  David P. Woodruff,et al.  Optimal Random Sampling from Distributed Streams Revisited , 2011, DISC.

[17]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[18]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[19]  Graham Cormode,et al.  Algorithms for distributed functional monitoring , 2008, SODA '08.

[20]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2012, J. Comput. Syst. Sci..

[21]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Rafail Ostrovsky,et al.  Weighted sampling without replacement from data streams , 2015, Inf. Process. Lett..

[23]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[24]  Wolfgang Lehner,et al.  Sampling time-based sliding windows in bounded space , 2008, SIGMOD Conference.

[25]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[26]  Carsten Lund,et al.  Estimating flow distributions from sampled flow statistics , 2005, TNET.

[27]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[28]  Chrisil Arackaparambil,et al.  Functional Monitoring without Monotonicity , 2009, ICALP.

[29]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[30]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[31]  Srikanta Tirthapura,et al.  Sketching asynchronous data streams over sliding windows , 2008, Distributed Computing.

[32]  David P. Woodruff,et al.  Perfect Lp Sampling in a Data Stream , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[33]  Y. Chung Distinct random sampling from a distributed stream , 2015 .

[34]  Alexandr Andoni,et al.  Streaming Algorithms via Precision Sampling , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[35]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[36]  Alexandr Andoni,et al.  Streaming Algorithms from Precision Sampling , 2010, ArXiv.

[37]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[38]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[39]  Qin Zhang,et al.  Optimal tracking of distributed heavy hitters and quantiles , 2009, PODS.