Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
暂无分享,去创建一个
[1] Nuno Neves,et al. RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[2] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[3] Ian T. Foster,et al. Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.
[4] Heon Young Yeom,et al. Design and Implementation of Dynamic Process Management for Grid-Enabled MPICH , 2003, PVM/MPI.
[5] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[6] Lorenzo Alvisi. Understanding the message logging paradigm for masking process crashes , 1996 .
[7] Jonathan Robinson,et al. The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..
[8] Adrianos Lachanas,et al. MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..
[9] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[10] Dhabaleswar K. Panda,et al. High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.
[11] Heon Young Yeom,et al. MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..
[12] Anthony Skjellum,et al. MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.
[13] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[14] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[15] Jyh-Jong Tsay,et al. Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.
[16] Dhabaleswar K. Panda,et al. High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.
[17] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[18] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[19] Heon Y. Yeom,et al. MPICH-GF: Providing Fault Tolerance on Grid Environments , 2003 .
[20] Andrew S. Grimshaw,et al. Integrating fault-tolerance techniques in grid applications , 2000 .