Fault-tolerant solutions for a MPI compute intensive application
暂无分享,去创建一个
[1] Rohit Mathur,et al. The stem-II regional-scale acid deposition and photochemical oxidant model. III: A study of mesoscale acid deposition in the lower Ohio river Valley , 1989 .
[2] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[3] Heon Y. Yeom,et al. MPICH-GF: Providing Fault Tolerance on Grid Environments , 2003 .
[4] Gabriel Rodríguez,et al. Controller/Precompiler for Portable Checkpointing , 2006, IEICE Trans. Inf. Syst..
[5] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[6] Gregory R. Carmichael,et al. The STEM-II regional scale acid deposition and photochemical oxidant model—I. An overview of model development and applications , 1991 .
[7] Jonathan Robinson,et al. The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..
[8] Adrianos Lachanas,et al. MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..
[9] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[10] J. Carlos Mouri,et al. HIGH PERFORMANCE AIR QUALITY SIMULATION IN THE EUROPEAN CROSSGRID PROJECT , 2006 .
[11] Harrick M. Vin,et al. Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[12] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[13] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[14] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[15] 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2007), 7-9 February 2007, Naples, Italy , 2007, PDP.
[16] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[17] Javier D. Bruguera,et al. High performance air pollution modeling for a power plant environment , 2003, Parallel Comput..