Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats

This paper reports on a comprehensive, fully automated resource use monitoring package, TACC Stats, which enables both consultants, users and other stakeholders in an HPC system to systematically and actively identify jobs/applications that could benefit from expert support and to aid in the diagnosis of software and hardware issues. TACC Stats continuously collects and analyzes resource usage data for every job run on a system and differs significantly from conventional profilers because it requires no action on the part of the user or consultants -- it is always collecting data on every node for every job. TACC Stats is open source and downloadable, configurable and compatible with general Linux-based computing platforms, and extensible to new CPU architectures and hardware devices. It is meant to provide a comprehensive resource usage monitoring solution. In addition to describing TACC Stats, the paper illustrates its application to identifying production jobs which have inefficient resource use characteristics.

[1]  James C. Browne,et al.  Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Si Liu,et al.  System-level monitoring of floating-point performance to improve effective system utilization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Bert J. Debusschere,et al.  Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Gregor von Laszewski,et al.  Performance metrics and auditing framework using application kernels for high‐performance computer systems , 2013, Concurr. Comput. Pract. Exp..

[5]  A. Krasnitz,et al.  Studying Quarks and Gluons On Mimd Parallel Computers , 1991, Int. J. High Perform. Comput. Appl..