GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems

Gossip protocols have proven to be effective means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. In this paper, we discuss the development and features of a Gossip-Enabled Monitoring Service (GEMS), a highly responsive and scalable resource monitoring service, to monitor health and performance information in heterogeneous distributed systems. GEMS has many novel and essential features such as detection of network partitions and dynamic insertion of new nodes into the service. Easily extensible, GEMS also incorporates facilities for distributing arbitrary system and application-specific data. We present experiments and analytical projections demonstrating scalability, fast response times and low resource utilization requirements, making GEMS a potent solution for resource monitoring in distributed computing.

[1]  Rajkumar Buyya,et al.  PARMON: a portable and scalable monitoring system for clusters , 2000, Softw. Pract. Exp..

[2]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[3]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Miron Livny,et al.  Managing network resources in Condor , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[5]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[6]  Arif Ghafoor,et al.  Semi-Distributed Load Balancing For Massively Parallel Multicomputer Systems , 1991, IEEE Trans. Software Eng..

[7]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[8]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[9]  Katherine Guo,et al.  Scalability of the microsoft cluster service , 1998 .

[10]  Alan D. George,et al.  Experimental Analysis of a Gossip-Based Service for Scalable, Distributed Failure Detection and Consensus , 2004, Cluster Computing.

[11]  Francis C. M. Lau,et al.  Nearest-neighbor algorithms for load-balancing in parallel computers , 1995, Concurr. Pract. Exp..

[12]  Srinivasan Parthasarathy,et al.  Customized Dynamic Load Balancing for a Network of Workstations , 1997, J. Parallel Distributed Comput..

[13]  Cho-Li Wang,et al.  ClusterProbe: an open, flexible and scalable cluster monitoring tool , 1999, ICWC 99. IEEE Computer Society International Workshop on Cluster Computing.

[14]  Alan D. George,et al.  Simulative performance analysis of gossip failure detection for scalable distributed systems , 2004, Cluster Computing.

[15]  Cauligi S. Raghavendra,et al.  A Dynamic Load-Balancing Policy With a Central Job Dispatcher (LBC) , 1992, IEEE Trans. Software Eng..

[16]  S. Zhou,et al.  A Trace-Driven Simulation Study of Dynamic Load Balancing , 1987, IEEE Trans. Software Eng..

[17]  Alan D. George,et al.  Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters , 2004, Cluster Computing.

[18]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[19]  Alan D. George,et al.  Performance analysis of flat and layered gossip services for failure detection and consensus in scalable heterogeneous clusters , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.