Enabling comprehensive data-driven system management for large computational facilities

This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats. TACC_Stats periodically records resource use including many hardware counters for each job running on each node. Furthermore, system level metrics are obtained through aggregation of the node (job) level data. Analysis of this data generates many types of standard and custom reports and even a limited predictive capability that has not previously been available for open-source, Linux-based software systems. This paper presents case studies of information that can be applied for effective resource management. We believe this system to be the first fully comprehensive system for supporting the information needs of all stakeholders in open-source software based HPC systems.

[1]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[2]  Gregor von Laszewski,et al.  Performance metrics and auditing framework using application kernels for high‐performance computer systems , 2013, Concurr. Comput. Pract. Exp..

[3]  Markus Geimer,et al.  Further Improving the Scalability of the Scalasca Toolset , 2010, PARA.

[4]  Allen D. Malony,et al.  Knowledge support and automation for performance analysis with PerfExplorer 2.0 , 2008 .

[5]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[7]  Si Liu,et al.  System-level monitoring of floating-point performance to improve effective system utilization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[9]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[10]  J. Simonoff Multivariate Density Estimation , 1996 .

[11]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[12]  Allen D. Malony,et al.  Knowledge support and automation for performance analysis with PerfExplorer 2.0 , 2008, Sci. Program..