Enabling Continuous Testing of HPC Systems Using ReFrame

Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this paper, we briefly present ReFrame, a framework for writing regression tests for HPC systems and how this is used by CSCS, NERSC and OSC to continuously test their systems. ReFrame is designed to abstract away the complexity of the interactions with the system and to separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. ReFrame can be easily set up on any cluster and its straightforward invocation allows it to be easily integrated with common continuous integration/deployment (CI/CD) tools, in order to perform continuous testing of an HPC system. Finally, its ability to feed the collected performance data to well known log channels, such as Syslog, Graylog or, simply, parsable log files, make it also a powerful tool for continuously monitoring the health of the system from user’s perspective.

[1]  Amiya K. Maji,et al.  Testpilot: A Flexible Framework for User-centric Testing of HPC Clusters , 2017 .

[2]  Samuel Williams,et al.  Analyzing Performance of Selected NESAP Applications on the Cori HPC System , 2017, ISC Workshops.

[3]  Peter W. Osel,et al.  Abstract Yourself With Modules , 1996, LISA.

[4]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Illia Horenko,et al.  Finite Element Approach to Clustering of Multidimensional Time Series , 2010, SIAM J. Sci. Comput..

[6]  Saumil Merchant,et al.  Tool for performance tuning and regression analyses of HPC systems and applications , 2012, 2012 19th International Conference on High Performance Computing.

[7]  Samuel Williams,et al.  Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight , 2019, Cluster Computing.

[8]  Samuel Khuvis,et al.  A Continuous Integration-Based Framework for Software Management , 2019, PEARC.

[9]  Samuel Williams,et al.  Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages , 2015, 2015 9th International Conference on Partitioned Global Address Space Programming Models.

[10]  Paul F. Dubois Testing Scientific Programs , 2012, Computing in Science & Engineering.

[11]  Brent N. Chun DART: Distributed Automated Regression Testing for Large-Scale Network Applications , 2004, OPODIS.

[12]  Andy Georges,et al.  EasyBuild: Building Software with Ease , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  John Shalf,et al.  HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems , 2014 .

[14]  Fabio Checconi,et al.  Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .