论文信息 - Variability in Data Streams

Variability in Data Streams

We consider the problem of tracking with small relative error an integer function f(n) defined by a distributed update stream f'(n) in the distributed monitoring model. In this model, there are k sites over which the updates f'(n) are distributed, and they must communicate with a central coordinator to maintain an estimate of f(n). Existing streaming algorithms with worst-case guarantees for this problem assume f(n) to be monotone; there are very large lower bounds on the space requirements for summarizing a distributed non-monotonic stream, often linear in the size n of the stream. However, the input streams obtaining these lower bounds are highly variable, making relatively large jumps from one timestep to the next; in practice, the impact on f(n) of any single update f'(n) is usually small. What has heretofore been lacking is a framework for non-monotonic streams that admits algorithms whose worst-case performance is as good as existing algorithms for monotone streams and degrades gracefully for non-monotonic streams as those streams vary more quickly. In this paper we propose such a framework. We introduce a stream parameter, the "variability" v, deriving its definition in a way that shows it to be a natural parameter to consider for non-monotonic streams. It is also a useful parameter. From a theoretical perspective, we can adapt existing algorithms for monotone streams to work for non-monotonic streams, with only minor modifications, in such a way that they reduce to the monotone case when the stream happens to be monotone, and in such a way that we can refine the worst-case communication bounds from θ(n) to Õv. From a practical perspective, we demonstrate that v can be small in practice by proving that v is O(log f(n)) for monotone streams and o(n) for streams that are "nearly" monotone or that are generated by random walks. We expect v to be o(n) for many other interesting input classes as well.

Rafail Ostrovsky | David Felber | R. Ostrovsky | David Felber

[1] A. Razborov. Communication Complexity , 2011 .

[2] Zhenming Liu,et al. Continuous distributed counting for non-monotonic streams , 2012, PODS '12.

[3] David P. Woodruff,et al. Tight bounds for distributed functional monitoring , 2011, STOC '12.

[4] Jian Pei,et al. Logging every footstep: quantile summaries for the entire history , 2010, SIGMOD Conference.

[5] Kai-Min Chung,et al. Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified , 2012, STACS.

[6] Chrisil Arackaparambil,et al. Functional Monitoring without Monotonicity , 2009, ICALP.

[7] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[8] Qin Zhang,et al. Optimal Tracking of Distributed Heavy Hitters and Quantiles , 2011, Algorithmica.

[9] Qin Zhang,et al. Randomized algorithms for tracking distributed count, frequencies, and ranks , 2012, PODS '12.

[10] Sumit Ganguly,et al. CR-precis: A Deterministic Summary Structure for Update Data Streams , 2006, ESCAPE.

[11] S. Muthukrishnan,et al. Data streams: algorithms and applications , 2005, SODA '03.

[12] Graham Cormode,et al. Algorithms for distributed functional monitoring , 2008, SODA '08.