Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure. Keywords— Energy consumption, energy saving, power management, fault tolerance, uncoordinated checkpoint, HPC, distributed memory, MPI, DVFS, ACPI

[1]  M. H. MacDougall Simulating computer systems: techniques and tools , 1989 .

[2]  Dolores Rexachs,et al.  Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols , 2017, J. Parallel Distributed Comput..

[3]  Kirk W. Cameron,et al.  Energy-efficient localised rollback via data flow analysis and frequency scaling , 2018, EuroMPI.

[4]  Thomas Hérault,et al.  Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization , 2013, Euro-Par.

[5]  Luca Benini,et al.  COUNTDOWN: a run-time library for application-agnostic energy saving in MPI communication primitives , 2018, ANDARE '18.

[6]  Wu-chun Feng,et al.  Trends in energy-efficient computing: A perspective from the Green500 , 2013, 2013 International Green Computing Conference Proceedings.

[7]  Hassan Ghasemzadeh,et al.  A Dynamic Programming Framework for DVFS-Based Energy-Efficiency in Multicore Systems , 2020, IEEE Transactions on Sustainable Computing.

[8]  Emilio Luque,et al.  Fault tolerance at system level based on RADIC architecture , 2015, J. Parallel Distributed Comput..

[9]  Allan Porterfield,et al.  An Adaptive Core-Specific Runtime for Energy Efficiency , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Torsten Hoefler,et al.  Using simulation to evaluate the performance of resilience strategies and process failures , 2014 .

[11]  Bernd Mohr,et al.  Determine energy-saving potential in wait-states of large-scale parallel programs , 2011, Computer Science - Research and Development.

[12]  Cho-Li Wang,et al.  Scalable group-based checkpoint/restart for large-scale message-passing systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Emilio Luque,et al.  Prediction of Energy Consumption by Checkpoint/Restart in HPC , 2019, IEEE Access.

[14]  Peng Zhang,et al.  A Survey of Homogeneous and Heterogeneous System Architectures in High Performance Computing , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).