Sketch-based Geometric Monitoring of Distributed Stream Queries

Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guarantee communication efficiency (over the underlying communication network). The complexity of the monitored query adds to the difficulty of the problem -- this is especially true for nonlinear queries (e.g., joins), where no obvious solutions exist for distributing the monitor condition across sites. The recently proposed geometric method offers a generic methodology for splitting an arbitrary (non-linear) global threshold-monitoring task into a collection of local site constraints; still, the approach relies on maintaining the complete stream(s) at each site, thus raising serious efficiency concerns for massive data streams. In this paper, we propose novel algorithms for efficiently tracking a broad class of complex aggregate queries in such distributed-streams settings. Our tracking schemes rely on a novel combination of the geometric method with compact sketch summaries of local data streams, and maintain approximate answers with provable error guarantees, while optimizing space and processing costs at each remote site and communication cost across the network. One of our key technical insights for the effective use of the geometric method lies in exploiting a much lower-dimensional space for monitoring the sketch-based estimation query. Due to the complex, highly nonlinear nature of these estimates, efficiently monitoring the local geometric constraints poses challenging algorithmic issues for which we propose novel solutions. Experimental results on real-life data streams verify the effectiveness of our approach.

[1]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[2]  Assaf Schuster,et al.  Shape Sensitive Geometric Monitoring , 2008, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[4]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[5]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[7]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[8]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[9]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[10]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[11]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[12]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[13]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS '04.

[14]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[15]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[16]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[17]  Assaf Schuster,et al.  Prediction-based geometric monitoring over distributed data streams , 2012, SIGMOD Conference.

[18]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  Wei Hong,et al.  Approximate Data Collection in Sensor Networks using Probabilistic Models , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  Graham Cormode,et al.  Large-Scale Distributed Computation (NII Shonan Meeting 2012-1) , 2012, NII Shonan Meet. Rep..

[21]  Graham Cormode,et al.  Approximate continuous querying over distributed streams , 2008, TODS.

[22]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[23]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[24]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[25]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[26]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[27]  Abhinandan Das,et al.  Distributed Set Expression Cardinality Estimation , 2004, VLDB.

[28]  Graham Cormode,et al.  Streaming in a connected world: querying and tracking distributed data streams , 2008, EDBT '08.

[29]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[30]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[31]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.