Understanding Application and System Performance Through System-Wide Monitoring

TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core and process level for every job executing on a monitored system. That data can be aggregated at the system, group, user, application, job, node, or core level. TACC Stats has been in production use for about 5 years and is now used by numerous HPC systems around the world. This paper reports on a major new version of TACC Stats and the additional analyses which can now be accomplished. The data collected is now a truly comprehensive range of metrics spanning all system resources including energy consumption, vectorization, I/O activity and network activity as well as a full set of computationally oriented metrics. TACC Stats also includes a new capability which enables online monitoring of the resource use data which is gathered. TACC Stats automatically customizes itself for different chip architectures and has been extended to execute on Cray systems. In additional to describing the new capabilities, we also describe several analyses, some incorporating the new data such as I/O behavior. These analyses and reports can give insights to identify performance issues with jobs and applications, diagnose system and job errors, and understand the resource needs of users.

[1]  Mark R. Fahey,et al.  User Environment Tracking and Problem Detection with XALT , 2014, 2014 First International Workshop on HPC User Support Tools.

[2]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[3]  Brendan Gregg,et al.  Systems Performance: Enterprise and the Cloud , 2013 .

[4]  James C. Browne,et al.  Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[6]  James C. Browne,et al.  Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats , 2014, 2014 First International Workshop on HPC User Support Tools.

[7]  Gregor von Laszewski,et al.  Comprehensive, open‐source resource usage measurement and analysis for HPC systems , 2014, Concurr. Comput. Pract. Exp..

[8]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[9]  Gregor von Laszewski,et al.  Using XDMoD to facilitate XSEDE operations, planning and analysis , 2013, XSEDE.

[10]  Si Liu,et al.  System-level monitoring of floating-point performance to improve effective system utilization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  James C. Browne,et al.  An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats , 2014, XSEDE '14.

[12]  Kevin T. Pedretti,et al.  Demonstrating improved application performance using dynamic monitoring and task mapping , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Ann C. Gentile,et al.  Infrastructure for In Situ System Monitoring and Application Data Analysis , 2015, ISAV@SC.

[15]  Bert J. Debusschere,et al.  Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Gregor von Laszewski,et al.  Performance metrics and auditing framework using application kernels for high‐performance computer systems , 2013, Concurr. Comput. Pract. Exp..

[17]  Zheng Wang,et al.  System support for automatic profiling and optimization , 1997, SOSP.

[18]  Zhenbang Chen,et al.  P-Tracer: Path-Based Performance Profiling in Cloud Computing Systems , 2012, 2012 IEEE 36th Annual Computer Software and Applications Conference.