Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
暂无分享,去创建一个
Dhabaleswar K. Panda | Wei Huang | Qi Gao | Weikuan Yu | D. Panda | Wei Huang | Weikuan Yu | Qi Gao
[1] Brian Randell. System structure for software fault tolerance , 1975 .
[2] Brian Randell. System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..
[3] Yuval Tamir,et al. ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .
[4] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[5] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[6] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[7] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[8] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[9] Henri E. Bal,et al. User-Level Network Interface Protocols , 1998, Computer.
[10] James Arthur Kohl,et al. HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..
[11] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[12] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[13] Remzi H. Arpaci-Dusseau,et al. Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).
[14] William Gropp,et al. Components and interfaces of a process management system for parallel programs , 2001, Parallel Comput..
[15] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[16] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[17] Dhabaleswar K. Panda,et al. High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.
[18] Dhabaleswar K. Panda,et al. High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.
[19] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[20] Dhabaleswar K. Panda,et al. Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[21] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .
[22] Eleanor Chu,et al. Minimizing Communication Penalty of Triangular Solvers by Runtime Mesh Configuration and Workload Redistribution , 2004, The Journal of Supercomputing.
[23] Mark A. Taylor,et al. Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[24] Dhabaleswar K. Panda,et al. Fast and Scalable Startup of MPI Programs in InfiniBand Clusters , 2004, HiPC.
[25] Gerrit Groenhof,et al. GROMACS: Fast, flexible, and free , 2005, J. Comput. Chem..
[26] Heon Young Yeom,et al. Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3) , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[27] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[28] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[29] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[30] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[31] Dhabaleswar K. Panda,et al. Adaptive connection management for scalable MPI over InfiniBand , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.