A Review of Supercomputer Performance Monitoring Systems

High Performance Computing is now one of the emerging fields in computer science and its applications. Top HPC facilities, supercomputers, offer great opportunities in modeling diverse processes thus allowing to create more and greater products without full-scale experiments. Current supercomputers and applications for them are very complex and thus are hard to use efficiently. Performance monitoring systems are the tools that help to understand the efficiency of supercomputing applications and overall supercomputer functioning. These systems collect data on what happens on a supercomputer (performance data, performance metrics) and present them in a way allowing to make conclusions about performance issues in programs running on the supercomputer. In this paper we give an overview of existing performance monitoring systems designed for or used on supercomputers. We give a comparison of performance monitoring systems found in literature, describe problems emerging in monitoring large scale HPC systems, and outline our vision on future direction of HPC monitoring systems development.

[1]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[2]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[3]  Ronald Minnich Supermon: High-Performance Monitoring for Linux Clusters , 2001, Annual Linux Showcase & Conference.

[4]  Hadi Sharifi,et al.  Push Me Pull You: Integrating Opposing Data Transport Modes for Efficient HPC Application Monitoring , 2015, 2015 IEEE International Conference on Cluster Computing.

[5]  Mark R. Fahey,et al.  User Environment Tracking and Problem Detection with XALT , 2014, 2014 First International Workshop on HPC User Support Tools.

[6]  Konstantin Stefanov,et al.  Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon) , 2015 .

[7]  Ronald Minnich,et al.  Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[8]  Bert J. Debusschere,et al.  Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  James C. Browne,et al.  Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats , 2014, 2014 First International Workshop on HPC User Support Tools.

[10]  Jie Li,et al.  MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  R. Scott Studham,et al.  NWPerf: a system wide performance monitoring tool for large Linux clusters , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[12]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  Gerhard Wellein,et al.  LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  Frank Mueller,et al.  Desh: deep learning for system health prediction of lead times to failure in HPC , 2018, HPDC.

[15]  Michael Kluge,et al.  Mapping of RAID Controller Performance Data to the Job History on Large Computing Systems , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[16]  James C. Browne,et al.  Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.

[17]  John M. May,et al.  MPX: Software for multiplexing hardware performance counters in multithreaded programs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[18]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[19]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20]  Wolfgang E. Nagel,et al.  Collecting Distributed Performance Data with Dataheap: Generating and Exploiting a Holistic System View , 2012, ICCS.

[21]  Rajkumar Buyya,et al.  PARMON: a portable and scalable monitoring system for clusters , 2000 .

[22]  Wolfgang Frings,et al.  Scalable Control and Monitoring of Supercomputer Applications Using an Integrated Tool Framework , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[23]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  James C. Browne,et al.  Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Jeanine Cook,et al.  Improved estimation for software multiplexing of performance counters , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.