HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments
暂无分享,去创建一个
[1] Raja Nassar,et al. Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).
[2] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[3] E. Hung. ELEC 6062 Project Report Fault Tolerance and Checkpointing Schemes for Clusters of Workstations , 1999 .
[4] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[5] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[6] Jack Dongarra,et al. Building fault surviv-able mpi programs with ft-mpi using diskless-checkpointing , 2005 .
[7] Ahmed Al-Nazer,et al. On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .
[8] Luís Moura Silva,et al. The performance of coordinated and independent checkpointing , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.
[9] Lui Sha,et al. Process resurrection: a fast recovery mechanism for real-time embedded systems , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.
[10] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[11] Victoria E. Howle,et al. Fault Tolerance in Large-Scale Scientific Computing , 2006, Parallel Processing for Scientific Computing.
[12] Almerico Murli,et al. Monitoring and Migration of a PETSc-based Parallel Application for Medical Imaging in a Grid computing PSE , 2006, Grid-Based Problem Solving Environments.
[13] Sathish S. Vadhiyar,et al. SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..
[14] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[15] George Bosilca,et al. Algorithmic Based Fault Tolerance Applied to High Performance Computing , 2008, ArXiv.
[16] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.