Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduct...

[1]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..

[2]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[3]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[4]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[5]  Dilma Da Silva,et al.  Alleviating scalability issues of checkpointing protocols , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .

[7]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11]  Manish Parashar,et al.  Exploring Failure Recovery for Stencil-based Applications at Extreme Scales , 2015, HPDC.

[12]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[13]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[15]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[16]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[17]  Andrew Lumsdaine,et al.  Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems , 2010 .

[18]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Christine Morin,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[20]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[21]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[22]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[23]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[24]  Jack Dongarra,et al.  Redesigning the message logging model for high performance , 2010, ISC 2010.

[25]  Niraj K. Jha,et al.  Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.

[26]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[27]  Andrew Lumsdaine,et al.  Interconnect agnostic checkpoint/restart in open MPI , 2009, HPDC '09.

[28]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[29]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[30]  Daniel S. Katz,et al.  Tests and Tolerances for High-Performance Software-Implemented Fault Detection , 2003, IEEE Trans. Computers.

[31]  Courtenay T. Vaughan,et al.  Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[32]  Torsten Hoefler,et al.  Understanding the Effects of Communication and Coordination on Checkpointing at Scale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Michael A. Heroux Toward resilient algorithms and applications , 2013, FTXS '13.