Continuous distributed counting for non-monotonic streams

We consider the continual count tracking problem in a distributed environment where the input is an aggregate stream that originates from k distinct sites and the updates are allowed to be non-monotonic, i.e. both increments and decrements are allowed. The goal is to continually track the count within a prescribed relative accuracy ε at the lowest possible communication cost. Specifically, we consider an adversarial setting where the input values are selected and assigned to sites by an adversary but the order is according to a random permutation or is a random i.i.d process. The input stream of values is allowed to be non-monotonic with an unknown drift -1≤μ=1 where the case μ = 1 corresponds to the special case of a monotonic stream of only non-negative updates. We show that a randomized algorithm guarantees to track the count accurately with high probability and has the expected communication cost Õ(min√k/(|#956;|ε), √k n/ε, n}), for an input stream of length n, and establish matching lower bounds. This improves upon previously best known algorithm whose expected communication cost is Θ(min√k/ε,n]) that applies only to an important but more restrictive class of monotonic input streams, and our results are substantially more positive than the communication complexity of Ω(n) under fully adversarial input. We also show how our framework can also accommodate other types of random input streams, including fractional Brownian motion that has been widely used to model temporal long-range dependencies observed in many natural phenomena. Last but not least, we show how our non-monotonic counter can be applied to track the second frequency moment and to a Bayesian linear regression problem.

[1]  Graham Cormode,et al.  Algorithms for distributed functional monitoring , 2008, SODA '08.

[2]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[3]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[4]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[5]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[6]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[7]  Qin Zhang,et al.  Optimal sampling from distributed streams , 2010, PODS '10.

[8]  Takis Konstantopoulos,et al.  MARKOV CHAINS AND RANDOM WALKS , 2009 .

[9]  D. Applebaum Stable non-Gaussian random processes , 1995, The Mathematical Gazette.

[10]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[11]  David P. Woodruff,et al.  Optimal Random Sampling from Distributed Streams Revisited , 2011, DISC.

[12]  Chrisil Arackaparambil,et al.  Functional Monitoring without Monotonicity , 2009, ICALP.

[13]  Murad S. Taqqu,et al.  On the Self-Similar Nature of Ethernet Traffic , 1993, SIGCOMM.

[14]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[15]  Qin Zhang,et al.  Optimal tracking of distributed heavy hitters and quantiles , 2009, PODS.

[16]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[17]  Peter J. Haas,et al.  Synopses for Massive Data , 2012 .

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[20]  Qin Zhang,et al.  Randomized algorithms for tracking distributed count, frequencies, and ranks , 2012, PODS '12.

[21]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[22]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[23]  Graham Cormode,et al.  What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.