NR-MPI: A Non-stop and Fault Resilient MPI
暂无分享,去创建一个
Xiangke Liao | Yutong Lu | Min Xie | Guang Suo | Hongjia Cao
[1] William Gropp,et al. MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.
[2] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[3] Heon Young Yeom,et al. Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3) , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[4] Franck Cappello,et al. Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.
[5] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[6] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[7] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Wesley Bland,et al. User Level Failure Mitigation in MPI , 2012, Euro-Par Workshops.
[9] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[11] Xuejun Yang,et al. FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing , 2009, IEEE Transactions on Parallel and Distributed Systems.
[12] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[13] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 2004, Cluster Computing.
[14] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[15] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[16] Rui Wang,et al. Building algorithmically nonstop fault tolerant MPI programs , 2011, 2011 18th International Conference on High Performance Computing.
[17] Kai Lu,et al. The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.
[18] Xuejun Yang,et al. Tianhe-1A Interconnect and Message-Passing Services , 2012, IEEE Micro.
[19] Thomas Hérault,et al. MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..
[20] Rui Wang,et al. Supporting User-directed Fault Tolerance over Standard MPI , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.
[21] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[22] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[23] Wesley Bland. Enabling Application Resilience with and without the MPI Standard , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[24] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[25] Richard L. Graham,et al. Building a Fault Tolerant MPI Application: A Ring Communication Example , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[26] James Arthur Kohl,et al. HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..
[27] Frank Mueller,et al. Transparent fault tolerance for job healing in hpc environments , 2009 .
[28] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[29] John Paul Walters,et al. Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.