Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping
暂无分享,去创建一个
Manish Parashar | Michael A. Heroux | Keita Teranishi | Jackson Mayo | Jacqueline Chen | Hemanth Kolla | Marc Gamell
[1] George Bosilca,et al. Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..
[2] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[3] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[4] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[5] Dilma Da Silva,et al. Alleviating scalability issues of checkpointing protocols , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] B. Brandfass,et al. Rank reordering for MPI communication optimization , 2013 .
[7] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[8] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[9] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[10] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[11] Manish Parashar,et al. Exploring Failure Recovery for Stencil-based Applications at Extreme Scales , 2015, HPDC.
[12] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[13] Scott Klasky,et al. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[15] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[16] Bapiraju Vinnakota,et al. A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.
[17] Andrew Lumsdaine,et al. Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems , 2010 .
[18] Manish Parashar,et al. Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Christine Morin,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[20] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[21] Hui Liu,et al. Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.
[22] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .
[23] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[24] Jack Dongarra,et al. Redesigning the message logging model for high performance , 2010, ISC 2010.
[25] Niraj K. Jha,et al. Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.
[26] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[27] Andrew Lumsdaine,et al. Interconnect agnostic checkpoint/restart in open MPI , 2009, HPDC '09.
[28] Michael A. Heroux,et al. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.
[29] M. Snir,et al. Ghost Cell Pattern , 2010, ParaPLoP '10.
[30] Daniel S. Katz,et al. Tests and Tolerances for High-Performance Software-Implemented Fault Detection , 2003, IEEE Trans. Computers.
[31] Courtenay T. Vaughan,et al. Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[32] Torsten Hoefler,et al. Understanding the Effects of Communication and Coordination on Checkpointing at Scale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Michael A. Heroux. Toward resilient algorithms and applications , 2013, FTXS '13.