When is multi-version checkpointing needed?

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience. We define a richer model for future systems that captures the reality of latent errors, i.e. errors that go undetected for some time, and use it to derive optimal checkpoint intervals for systems with latent errors. With that model, we explore the importance of multi-version checkpoint systems. Our results highlight the limits of single checkpoint systems, showing that two to more than a dozen checkpoints may be needed to achieve acceptable error coverage. Further, to achieve reasonable system efficiency, multiple versions (two to seventeen) may be needed. We study several specific exascale machine scenarios, and the results show that two checkpoints are always beneficial, but when checkpoint overheads are reduced, as many as three checkpoints are beneficial.

[1]  Sally A. McKee,et al.  ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[2]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Babak Falsafi,et al.  ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications , 2010, ASPLOS XV.

[4]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[5]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[7]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[8]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[10]  Robert B. Ross,et al.  ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization , 2012, HPDC '12.

[11]  Andrew A. Chien,et al.  An evaluation of difference and threshold techniques for efficient checkpoints , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[12]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Bronis R. de Supinski,et al.  McrEngine: a scalable checkpointing system using data-aware aggregation and compression , 2012, HiPC 2012.

[14]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[15]  Todd C. Mowry,et al.  Log-based architectures: using multicore to help software behave correctly , 2011, OPSR.

[16]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[17]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[18]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[19]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Robert Latham,et al.  ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[22]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[23]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[24]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[26]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[27]  Milos Prvulovic,et al.  Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[28]  Ron Brightwell,et al.  Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.

[29]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[30]  Jason Duell,et al.  Requirements for Linux Checkpoint/Restart , 2002 .

[31]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[32]  John T. Daly,et al.  Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[33]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[34]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.