A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
暂无分享,去创建一个
Thomas Hérault | George Bosilca | Jack J. Dongarra | Peng Du | Aurelien Bouteiller | Wesley Bland | J. Dongarra | Aurélien Bouteiller | G. Bosilca | T. Hérault | Peng Du | Wesley Bland
[1] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[2] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[3] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[4] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[5] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[6] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[7] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .
[8] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[9] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[10] Franklin T. Luk,et al. An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..
[11] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[12] Jack Dongarra,et al. ScaLAPACK user's guide , 1997 .
[13] Jack Dongarra,et al. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.
[14] Franck Cappello,et al. Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers , 2011, Parallel Process. Lett..
[15] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[16] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.
[17] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[18] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[19] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..