State Monitoring in Cloud Datacenters

Monitoring global states of a distributed cloud application is a critical functionality for cloud datacenter management. State monitoring requires meeting two demanding objectives: high level of correctness, which ensures zero or low error rate, and high communication efficiency, which demands minimal communication cost in detecting state updates. Most existing work follows an instantaneous model which triggers state alerts whenever a constraint is violated. This model may cause frequent and unnecessary alerts due to momentary value bursts and outliers. Countermeasures of such alerts may further cause problematic operations. In this paper, we present a WIndow-based StatE monitoring (WISE) framework for efficiently managing cloud applications. Window-based state monitoring reports alerts only when state violation is continuous within a time window. We show that it is not only more resilient to value bursts and outliers, but also able to save considerable communication when implemented in a distributed manner based on four technical contributions. First, we present the architectural design and deployment options for window-based state monitoring with centralized parameter tuning. Second, we develop a new distributed parameter tuning scheme enabling WISE to scale to much more monitoring nodes as each node tunes its monitoring parameters reactively without global information. Third, we introduce two optimization techniques, including their design rationale, correctness and usage model, to further reduce the communication cost. Finally, we provide an in-depth empirical study of the scalability of WISE, and evaluate the improvement brought by the distributed tuning scheme and the two performance optimizations. Our results show that WISE reduces communication by 50-90 percent compared with instantaneous monitoring approaches, and the improved WISE gains a clear scalability advantage over its centralized version.

[1]  Danny Raz,et al.  Efficient reactive monitoring , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[2]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[3]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[4]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[5]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[6]  Suman Nath,et al.  Tributaries and deltas: efficient and robust aggregation in sensor network streams , 2005, SIGMOD '05.

[7]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[8]  Ambuj K. Singh,et al.  A unified framework for monitoring data streams in real time , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Kamesh Munagala,et al.  Energy-efficient monitoring of extreme values in sensor networks , 2006, SIGMOD Conference.

[10]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[11]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[12]  D. Keren,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2006, TODS.

[13]  Yin Zhang,et al.  STAR: Self-Tuning Aggregation for Scalable Monitoring , 2007, VLDB.

[14]  Ling Huang,et al.  Communication-Efficient Tracking of Distributed Cumulative Triggers , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[15]  Yunhao Liu,et al.  Non-Threshold based Event Detection for 3D Environment Monitoring in Sensor Networks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[16]  Rajeev Rastogi,et al.  Efficient Detection of Distributed Constraint Violations , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Yin Zhang,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 87 Network Imprecision: a New Consistency Metric for Scalable Monitoring , 2022 .

[18]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[19]  Alex Delis,et al.  Outlier-Aware Data Aggregation in Sensor Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Pushpraj Shukla,et al.  Efficient Constraint Monitoring Using Adaptive Thresholds , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Shicong Meng,et al.  REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[22]  Shicong Meng,et al.  Monitoring continuous state violation in datacenters: Exploring the time dimension , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).