Post-failure recovery of MPI communication capability
暂无分享,去创建一个
[1] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[2] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[3] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[4] G.V. Kopcsay,et al. Creating the BlueGene/L supercomputer from low-power SoC ASICs , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..
[5] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Thomas Hérault,et al. Correlated Set Coordination in Fault Tolerant Message Logging Protocols , 2011, Euro-Par.
[7] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[8] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[10] Thomas Hérault,et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols , 2008, Future Gener. Comput. Syst..
[11] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[12] DongarraJack,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012 .
[13] George Bosilca,et al. Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..
[14] Franck Cappello,et al. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[15] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[16] Greg Bronevetsky,et al. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.
[17] Luís Moura Silva,et al. System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[18] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[19] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[20] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[21] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[22] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.
[23] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .
[24] Thomas Hérault,et al. An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.
[25] F. Al-Shamali,et al. Author Biographies. , 2015, Journal of social work in disability & rehabilitation.
[26] Wu-chun Feng,et al. Performance Evaluation of the Quadrics Interconnection Network , 2001, IPDPS.
[27] Thomas Hérault,et al. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.
[28] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[29] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[30] Laxmikant V. Kalé,et al. Team-Based Message Logging: Preliminary Results , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[31] Fabrizio Petrini,et al. Performance Evaluation of the Quadrics Interconnection Network , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[32] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[33] Lorenzo Alvisi,et al. Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.
[34] Thomas Hérault,et al. An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.