Analysis of the Jobs Resource Utilization on a Production System

In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc.) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months.

[1]  Roger Curry,et al.  Job centric cluster monitoring , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[2]  Nicholas J. Wright,et al.  Effective Performance Measurement at Petascale Using IPM , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[3]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[4]  Emir Imamagic,et al.  Grid infrastructure monitoring system based on Nagios , 2007, GMW '07.

[5]  Georges Da Costa,et al.  2005 IEEE International Symposium on Cluster Computing and the Grid , 2005, CCGRID.

[6]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[7]  Ramin Yahyapour,et al.  Scaling of Workload Traces , 2003, JSSPP.

[8]  Uwe Schwiegelshohn,et al.  Job Scheduling Strategies for Parallel Processing: 9th International Workshop, Jsspp 2003, Seattle, Wa, Usa, June 24, 2003: Revised Papers (Lecture Notes in Computer Science, 2862) , 2003 .

[9]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[10]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[11]  Warren Smith,et al.  Benchmarks and Standards for the Evaluation of Parallel Job Schedulers , 1999, JSSPP.

[12]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[13]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[14]  Ramin Yahyapour,et al.  Parallel Computer Workload Modeling with Markov Chains , 2004, JSSPP.

[15]  Allen D. Malony,et al.  TAUoverSupermon : Low-Overhead Online Parallel Performance Monitoring , 2007, Euro-Par.

[16]  Larry Rudolph,et al.  Valuation of Ultra-scale Computing Systems , 2000, JSSPP.

[17]  Anand Sivasubramaniam,et al.  Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[19]  Anne-Marie Kermarrec,et al.  Proceedings of the 13th European international conference on Parallel Processing , 2007 .

[20]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Ronald Minnich,et al.  Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[22]  Arthur B. Maccabe,et al.  A Framework for Analyzing Linux System Overheads on HPC Applications ∗ , 2005 .