Comprehensive job level resource usage measurement and analysis for XSEDE HPC systems

This paper presents a methodology for comprehensive job level resource use measurement and analysis and applications of the analyses to planning for HPC systems and a case study application of the methodology to the XSEDE Ranger and Lonestar4 systems at the University of Texas. The steps in the methodology are: System-wide collection of resource use and performance statistics at the job and node levels, mapping and storage of the resultant job-wise data to a relational database which eases further implementation and transformation of data to the formats required by specific statistical and analytical algorithms. Analyses can be carried out at different levels of granularity: job, user, or system-wide basis. Measurements are based on a novel lightweight job-centric measurement tool "TACC_Stats" [1], which gathers a comprehensive set of metrics on all compute nodes. The data mapping and analysis tools will be an extension to the XDMoD project [2] for the XSEDE community. This paper also reports the preliminary results from the analysis of measured data for Texas Advanced Computing Center's Lonestar4 and Ranger supercomputers. The case studies presented indicate the level of detailed information that will be available for all resources when TACC_Stats is deployed throughout the XSEDE system. The methodology can be applied to any system that runs the TACC_Stats measurement tool.

[1]  Jianpeng Ma,et al.  CHARMM: The biomolecular simulation program , 2009, J. Comput. Chem..

[2]  Allen D. Malony,et al.  Knowledge support and automation for performance analysis with PerfExplorer 2.0 , 2008, Sci. Program..

[3]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[4]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[5]  Allen D. Malony,et al.  Knowledge support and automation for performance analysis with PerfExplorer 2.0 , 2008 .

[6]  D. Sánchez-Portal,et al.  The SIESTA method for ab initio order-N materials simulation , 2001, cond-mat/0111138.

[7]  J. Simonoff Multivariate Density Estimation , 1996 .

[8]  Markus Geimer,et al.  Further Improving the Scalability of the Scalasca Toolset , 2010, PARA.

[9]  Si Liu,et al.  System-level monitoring of floating-point performance to improve effective system utilization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[12]  Gregor von Laszewski,et al.  Performance metrics and auditing framework using application kernels for high‐performance computer systems , 2013, Concurr. Comput. Pract. Exp..