Wide area cluster monitoring with Ganglia

In this paper, we present a structure for monitoring a large set of computational clusters. We illustrate methods for scaling a monitor network comprised of many clusters while keeping processing requirements low. A design for presenting high-level Web-based summaries of the monitor network is provided, along with a generalization to a distributed, multiple-resolution monitoring tree. Emphasis is placed on scalability, fast query response, fault tolerance, and grid compatibility. Experimental evidence is presented that demonstrates the performance of our design.

[1]  Joseph Y. Halpern,et al.  Knowledge and common knowledge in a distributed environment , 1984, JACM.

[2]  Warren Smith,et al.  A directory service for configuring high-performance distributed computations , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[3]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[4]  Steven McCanne,et al.  A model, analysis, and protocol framework for soft state-based communication , 1999, SIGCOMM '99.

[5]  Danny Raz,et al.  Toward efficient monitoring , 2000, IEEE Journal on Selected Areas in Communications.

[6]  Arnaud Le Hors,et al.  Document Object Model (DOM) Level 2 Core Specification - Version 1.0 , 2000 .

[7]  Danny Raz,et al.  Efficient reactive monitoring , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[8]  Ronald Minnich,et al.  Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[9]  Philip M. Papadopoulos,et al.  Leveraging standard core technologies to programmatically build Linux cluster appliances , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[10]  Christian Poellabauer,et al.  Resource-aware stream management with the customizable dproc distributed monitoring mechanisms , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[11]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..