To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing

As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.

[1]  Thomas Hérault,et al.  Optimal Checkpointing Period: Time vs. Energy , 2013, PMBS@SC.

[2]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  Matthias S. Müller,et al.  Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks , 2010, Computer Science - Research and Development.

[4]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[5]  Rolf Riesen,et al.  Evaluating energy savings for checkpoint/restart , 2013, E2SC '13.

[6]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[7]  Franck Cappello,et al.  Energy considerations in checkpointing and fault tolerance protocols , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[8]  Matthew L. Curry,et al.  Power use of disk subsystems in supercomputers , 2011, PDSW '11.

[9]  Franck Cappello,et al.  ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions , 2013, CCGrid 2013.

[10]  Laxmikant V. Kalé,et al.  Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[11]  Bianca Schroeder,et al.  Checkpoint/restart in practice: When ‘simple is better’ , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[13]  Satoshi Matsuoka,et al.  Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system , 2013, FTXS '13.

[14]  W. Marsden I and J , 2012 .