A geometric approach to monitoring threshold functions over distributed data streams

Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms. We present a novel geometric approach which reduces monitoring the value of a function (vis-à-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner. We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.

[1]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[2]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[3]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[4]  Ambuj K. Singh,et al.  Distributed data streams indexing using content-based routing paradigm , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[7]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[8]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[9]  Danny Raz,et al.  Efficient reactive monitoring , 2002, IEEE J. Sel. Areas Commun..

[10]  Klemens Böhm,et al.  Proceedings of the International Conference on Very Large Data Bases , 2005 .

[11]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[12]  Jean Ponce,et al.  Using Geometric Distance Fits for 3-D Object Modeling and Recognition , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[14]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[15]  Pablo A. Parrilo,et al.  Semidefinite programming relaxations for semialgebraic problems , 2003, Math. Program..

[16]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[17]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[18]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[19]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[20]  Jean B. Lasserre,et al.  Global Optimization with Polynomials and the Problem of Moments , 2000, SIAM J. Optim..

[21]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[22]  Kaleem Siddiqi,et al.  Hamilton-Jacobi Skeletons , 2002, International Journal of Computer Vision.

[23]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[25]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[26]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[27]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[28]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[29]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[30]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2004, Theory of Computing Systems.

[31]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[32]  Dinesh Manocha,et al.  Topology preserving surface extraction using adaptive subdivision , 2004, SGP '04.

[33]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[34]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[35]  J. Hellerstein,et al.  A Wakeup Call for Internet Monitoring Systems : The Case for Distributed Triggers , 2004 .

[36]  Ling Huang,et al.  Toward sophisticated detection with distributed triggers , 2006, MineNet '06.

[37]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[38]  L. Berkovitz Convexity and Optimization in Rn , 2001 .