Job centric cluster monitoring

This paper describes a system for monitoring jobs on large computational clusters. The aim is to extract information that is most useful for understanding the complete life-cycle of a job, combining and organising data from multiple sources. Information is taken from the batch scheduler and from collectors running on each node. These collect information about processes associated with the jobs as well as general operating system and device statistics. Heuristics are applied to extract information that could help a client tune job submission strategy, to provide better throughput on this cluster and to determine how effectively the provisioned resources are being utilised. Data is stored for post-mortem analysis and data-mining by other tools. Ways of utilising this service in a grid computing environment are discussed

[1]  Ronald Minnich Supermon: High-Performance Monitoring for Linux Clusters , 2001, Annual Linux Showcase & Conference.

[2]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[3]  David E. Culler,et al.  Wide area cluster monitoring with Ganglia , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[4]  Richard McClatchey,et al.  Job Monitoring in an Interactive Grid Analysis Environment , 2004 .

[5]  Ian Foster,et al.  The Globus toolkit , 1998 .

[6]  Ciprian Dobre,et al.  MonALISA: An agent based, dynamic service system to monitor, control and optimize grid base applications , 2005 .

[7]  Ronald Minnich,et al.  Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.