Fully decentralized computation of aggregates over data streams

In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios.

[1]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[2]  Minos N. Garofalakis,et al.  Streaming in a connected world: querying and tracking distributed data streams , 2006, VLDB.

[3]  Rajeev Rastogi,et al.  Efficient gossip-based aggregate computation , 2006, PODS.

[4]  Graham Cormode,et al.  What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Eran Shir,et al.  DIMES: let the internet measure itself , 2005, CCRV.

[6]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[7]  Marios Hadjieleftheriou,et al.  Robust Sketching and Aggregation of Distributed Data Streams , 2005 .

[8]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[9]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[10]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[11]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[13]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[14]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[15]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[16]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[17]  P. Indyk Stable distributions, pseudorandom generators, embeddings and data stream computation , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[19]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[20]  Phillip B. Gibbons,et al.  Tracking join and self-join sizes in limited storage , 1999, J. Comput. Syst. Sci..

[21]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[22]  Arne Kemper On Graph Problems in a Semi-streaming Model , 2003 .