An Evaluation of User-Level Failure Mitigation Support in MPI
暂无分享,去创建一个
Thomas Hérault | George Bosilca | Jack J. Dongarra | Aurelien Bouteiller | Joshua Hursey | Wesley Bland
[1] Thomas Hérault,et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols , 2008, Future Gener. Comput. Syst..
[2] Thomas Naughton,et al. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI , 2011, EuroMPI.
[3] Sape J. Mullender,et al. Distributed systems (2nd Ed.) , 1993 .
[4] Jack Dongarra,et al. A Proposal for User-Level Failure Mitigation in the MPI-3 Standard , 2012 .
[5] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[6] Jack Dongarra,et al. Redesigning the message logging model for high performance , 2010, ISC 2010.
[7] Thomas Hérault,et al. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.
[8] George Bosilca,et al. Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology , 2007, ISPA.
[9] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[10] Henri Casanova,et al. Using group replication for resilience on exascale systems , 2014, Int. J. High Perform. Comput. Appl..
[11] Ewing L. Lusk,et al. Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.
[12] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[13] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[14] Sam Toueg,et al. Fault-tolerant broadcasts and related problems , 1993 .
[15] Greg Bronevetsky,et al. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.
[16] Thomas Hérault,et al. Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..
[17] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[18] Thomas L. Sterling. HPC in Phase Change: Towards a New Execution Model , 2010, VECPAR.
[19] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[20] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[21] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..