Resilient Self-Compressive Monitoring for Large-Scale Hosting Infrastructures

Large-scale hosting infrastructures have become the fundamental platforms for many real-world systems such as cloud computing infrastructures, enterprise data centers, and massive data processing systems. However, it is a challenging task to achieve both scalability and high precision while monitoring a large number of intranode and internode attributes (e.g., CPU usage, free memory, free disk, internode network delay). In this paper, we present the design and implementation of a Resilient self-Compressive Monitoring (RCM) system for large-scale hosting infrastructures. RCM achieves scalable distributed monitoring by performing online data compression to reduce remote data collection cost. RCM provides failure resilience to achieve robust monitoring for dynamic distributed systems where host and network failures are common. We have conducted extensive experiments using a set of real monitoring data from NCSU's virtual computing lab (VCL), PlanetLab, a Google cluster, and real Internet traffic matrices. The experimental results show that RCM can achieve up to 200 percent higher compression ratio and several orders of magnitude less overhead than the existing approaches.

[1]  Shan Lu,et al.  Flight data recorder: monitoring persistent-state interactions to improve systems management , 2006, OSDI '06.

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Martin Burtscher,et al.  VPC3: a fast and effective trace-compression algorithm , 2004, SIGMETRICS '04/Performance '04.

[4]  Yin Zhang,et al.  STAR: Self-Tuning Aggregation for Scalable Monitoring , 2007, VLDB.

[5]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[6]  Michael Dahlin,et al.  A scalable distributed information management system , 2004, SIGCOMM.

[7]  Abhishek Chandra,et al.  Resource Bundles: Using Aggregation for Statistical Wide-Area Resource Discovery and Allocation , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[8]  Haixun Wang,et al.  Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  WillingerWalter,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009 .

[10]  Zhenhuan Gong,et al.  Self-correlating predictive information tracking for large-scale production systems , 2009, ICAC '09.

[11]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[12]  Xiaohui Gu,et al.  OLIC: OnLine Information Compression for scalable hosting infrastructure monitoring , 2011, 2011 IEEE Nineteenth IEEE International Workshop on Quality of Service.

[13]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[14]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[15]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[16]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[17]  Klara Nahrstedt,et al.  Self-Configuring Information Management for Large-Scale Service Overlays , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[20]  Xiaohui Gu,et al.  On Predictability of System Anomalies in Real World , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[21]  Amin Vahdat,et al.  Design and implementation tradeoffs for wide-area resource discovery , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[22]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[23]  Sang Hyuk Son,et al.  RESTORE: A Real-Time Event Correlation and Storage Service for Sensor Networks , 2006 .

[24]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[25]  Salim Hariri,et al.  Autonomic Computing: An Overview , 2004, UPP.

[26]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[27]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[28]  I.F. Akyildiz,et al.  Spatial correlation-based collaborative medium access control in wireless sensor networks , 2006, IEEE/ACM Transactions on Networking.

[29]  Walter Willinger,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009, SIGCOMM '09.

[30]  Kai-Kuang Ma,et al.  A new diamond search algorithm for fast block-matching motion estimation , 2000, IEEE Trans. Image Process..

[31]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[32]  Navendu Jain,et al.  Adaptive Control of Extreme-scale Stream Processing Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[33]  Zhenhuan Gong,et al.  PAC: Pattern-driven Application Consolidation for Efficient Cloud Computing , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[34]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[35]  Thomas Sikora,et al.  Trends and Perspectives in Image and Video Coding , 2005, Proceedings of the IEEE.

[36]  Steve Uhlig,et al.  Providing public intradomain traffic matrices to the research community , 2006, CCRV.

[37]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.