Disaster Survival Guide in Petascale Computing: An Algorithmic Approach
暂无分享,去创建一个
Jack Dongarra | George Bosilca | Zizhong Chen | Julien Langou | J. Dongarra | G. Bosilca | J. Langou | Zizhong Chen
[1] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[2] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).
[3] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[4] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.
[5] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .
[6] Jack Dongarra,et al. Top500 Supercomputer Sites - 13th edition , 1998 .
[7] Zizhong Chen,et al. Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..
[8] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[9] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[10] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[11] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .
[12] Ian Foster,et al. The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.
[13] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[14] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[15] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[16] Zizhong Chen,et al. Condition Numbers of Gaussian Random Matrices , 2005, SIAM J. Matrix Anal. Appl..
[17] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[18] Ami Marowka,et al. The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..
[19] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..
[20] Nitin H. Vaidya,et al. A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.
[21] Tzi-cker Chiueh,et al. Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.
[22] A. Edelman. Eigenvalues and condition numbers of random matrices , 1988 .
[23] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[24] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[25] Jack Dongarra,et al. Fault-tolerant matrix operations for parallel and distributed systems , 1996 .
[26] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[27] Zizhong Chen,et al. Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..