RFDMon: A Real-time and Fault-tolerant Distributed System Monitoring Approach

One of the main requirements for building an autonomic system is to have a robust monitoring frame- work. In this paper, a systematic distributed event based (DEB) system monitoring framework "RFDMon" is presented for measuring system variables (CPU utilization, memory utilization, disk utilization, network utilization, etc.), system health (temperature and voltage of Motherboard and CPU) application performance variables (application response time, queue size, and throughput), and scientific application data structures (PBS information and MPI variables) accurately with minimum latency at a specified rate and with controllable resource utilization. This framework is designed to be tolerant to faults in monitoring framework, self-configuring (can start and stop monitoring the nodes and configure monitors for threshold values/changes for publishing the measurements), aware of execution of the framework on multiple nodes through HEARTBEAT messages, extensive (monitors multiple parameters through periodic and aperiodic sensors), resource constrainable (computational resources can be limited for monitors), and expandable for adding extra monitors on the fly. Since RFDMon uses a Data Distribution Services (DDS) middleware, it can be used for deploying in systems with heterogeneous nodes. Additionally, it provides a functionality to limit the maximum cap on resources consumed by monitoring processes such that it reduces the effect on the availability of resources for the applications.

[1]  Scott D. Stoller Leader Election in Distributed Systems with Crash Failures , 1999 .

[2]  Sherif Abdelwahed,et al.  Large Scale Monitoring and Online Analysis in a Distributed Virtualized Environment , 2011, 2011 Eighth IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems.

[3]  Sherif Abdelwahed,et al.  Power-Aware Modeling and Autonomic Management Framework for Distributed Computing Systems , 2012, Handbook of Energy-Aware and Green Computing.

[4]  Sherif Abdelwahed,et al.  Power-Aware Modeling and Autonomic Management Framework for Distributed Computing Systems , 2012 .

[5]  Gabor Karsai,et al.  A component model for hard real‐time systems: CCM with ARINC‐653 , 2011, Softw. Pract. Exp..

[6]  Sherif Abdelwahed,et al.  Compensating for Timing Jitter in Computing Systems with General-Purpose Operating Systems , 2009, 2009 IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing.

[7]  G. Horvath,et al.  Software Fault Protection with ARINC 653 , 2007, 2007 IEEE Aerospace Conference.

[8]  Pan Pan,et al.  Dynamic Workflow Management and Monitoring Using DDS , 2010, 2010 Seventh IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems.

[9]  Hans Svensson,et al.  A new leader election implementation , 2005, ERLANG '05.

[10]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[11]  José Rufino,et al.  ARINC 653 in Space , 2005 .

[12]  Konrad Slind,et al.  Monitoring distributed systems , 1987, TOCS.

[13]  Derek S. Wung Intelligent Platform Management Interface (IPMI) , 2010 .

[14]  Shay Kutten,et al.  A modular technique for the design of efficient distributed leader finding algorithms , 1990, TOPL.

[15]  Gurdip Singh,et al.  Leader Election in the Presence of Link Failures , 1996, IEEE Trans. Parallel Distributed Syst..

[16]  Flaviu Cristian,et al.  A Highly Available Local Leader Election Service , 1999, IEEE Trans. Software Eng..

[17]  Lorenzo Falai,et al.  Observing, Monitoring and Evaluating Distributed Systems , 2007 .