Cray System Monitoring Successes , Requirements , and Priorities

Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis. We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities.

[1]  Alex Kristiansen,et al.  Use of the ERD for administrative monitoring of Theta , 2019, Concurr. Comput. Pract. Exp..

[2]  Stephen L. Olivier,et al.  Standardizing Power Monitoring and Control at Exascale , 2016, Computer.

[3]  A. Gentile,et al.  Network Performance Counter Monitoring and Analysis on the Cray XC Platform. , 2016 .

[4]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[7]  Michael Gienger,et al.  Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures , 2017 .

[8]  N. Cardo,et al.  An Operational Perspective on a Hybrid and Heterogeneous Cray XC 50 System , 2017 .

[9]  Bilel Hadri,et al.  The Automatic Library Tracking Database , 2010 .