Orchestrating Fault Prediction with Live Migration and Checkpointing

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

[1]  Yves Robert,et al.  Towards Optimal Multi-Level Checkpointing , 2017, IEEE Transactions on Computers.

[2]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[3]  Misbah Mubarak,et al.  Evaluating Burst Buffer Placement in HPC Systems , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Andrew A. Chien,et al.  How Much SSD Is Useful for Resilience in Supercomputers , 2015, FTXS@HPDC.

[5]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[6]  Dong H. Ahn,et al.  Flux: Overcoming Scheduling Challenges for Exascale Workflows , 2018, 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS).

[7]  Scott Klasky,et al.  Comprehensive Measurement and Analysis of the User-Perceived I/O Performance in a Production Leadership-Class Storage System , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[8]  Frank Mueller,et al.  Desh: deep learning for system health prediction of lead times to failure in HPC , 2018, HPDC.

[9]  Surendra Byna,et al.  Accelerating Science with the NERSC Burst Buffer Early User Program , 2016 .

[10]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11]  Scott B. Baden,et al.  Doomsday: Predicting Which Node Will Fail When on Supercomputers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[13]  Satoshi Matsuoka,et al.  Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[17]  Franck Cappello,et al.  VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Franck Cappello,et al.  Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.

[21]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[22]  Tirthak Patel,et al.  Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[23]  Sathish S. Vadhiyar,et al.  Fault Tolerance on Large Scale Systems using Adaptive Process Replication , 2015, IEEE Transactions on Computers.

[24]  Lipeng Wan,et al.  Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems , 2017, J. Parallel Distributed Comput..

[25]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[26]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[27]  Bronis R. de Supinski,et al.  The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[30]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Satoshi Matsuoka,et al.  A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[32]  Kamil Iskra,et al.  ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.

[33]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .