The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
暂无分享,去创建一个
Jason Duell | Brian W. Barrett | Sriram Sankaran | Andrew Lumsdaine | Jeffrey M. Squyres | Vishal Sahay | Eric Roman | Paul Hargrove | A. Lumsdaine | J. Squyres | V. Sahay | S. Sankaran | J. Duell | P. Hargrove | Eric Roman | Paul H. Hargrove
[1] Brian Randell. System structure for software fault tolerance , 1975 .
[2] Brian Randell. System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..
[3] David L. Russell,et al. State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.
[4] Yuval Tamir,et al. ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .
[5] Augusto Ciuffoletti,et al. A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.
[6] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[7] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.
[8] Taesoon Park,et al. Checkpointing and rollback-recovery in distributed systems , 1989 .
[9] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[10] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[11] Richard Y. Kain,et al. Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..
[12] Corporate The MPI Forum. MPI: a message passing interface , 1993, Supercomputing '93.
[13] William Gropp,et al. Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .
[14] Forum Mpi. MPI: A Message-Passing Interface , 1994 .
[15] Anthony Skjellum,et al. Extending the message passing interface (MPI) , 1994, Proceedings Scalable Parallel Libraries Conference.
[16] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[17] W. Kent Fuchs,et al. Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..
[18] William Gropp,et al. MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.
[19] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[20] Hua Zhong,et al. CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .
[21] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .
[22] William Gropp,et al. Users guide for mpich, a portable implementation of MPI , 1996 .
[23] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[24] Miron Livny,et al. Managing Checkpoints for Parallel Programs , 1996, JSSPP.
[25] Jyh-Jong Tsay,et al. Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.
[26] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[27] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[28] William Gropp,et al. Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .
[29] William Gropp,et al. MPI: The Complete Reference , Vol. 2 - The MPI-2 Extensions , 1998 .
[30] William Gropp. The MPI-2 extensions , 1998 .
[31] William R. Dieter,et al. A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[32] Marvin Solomon,et al. The evolution of Condor checkpointing , 1999 .
[33] Fred Douglis,et al. Mobility: Processes, Computers, and Agents , 1999 .
[34] Leonid Oliker,et al. System Utilization Benchmark on the Cray T3E and IBM SP , 2000, JSSPP.
[35] Jonathan D. Trent,et al. Astrobiology Technology Branch, NASA Ames Research Center, Moffett Field CA , 2000 .
[36] Jack J. Dongarra,et al. HARNESS and fault tolerant MPI , 2001, Parallel Comput..
[37] Carsten Franke,et al. Job Scheduling Strategies for Parallel Processing , 2002, Lecture Notes in Computer Science.
[38] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .
[39] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[40] Dhiraj K. Pradhan,et al. Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.
[41] An Overview of the BlueGene/L Supercomputer , 2002 .
[42] Steven J. Deitz,et al. Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.
[43] Brian W. Barrett,et al. The system services interface (SSI) to LAM/MPI , 2003 .
[44] Brian W. Barrett,et al. Request progression interface (RPI) system services interface (SSI) modules for LAM/MPI , 2003 .
[45] Robert B. Ross,et al. Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.
[46] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.
[47] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[48] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[49] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .