Scalable techniques for fault tolerant high performance computing
暂无分享,去创建一个
[1] Suku Nair,et al. Efficient Techniques for the Analysis of Algorithm-Based Fault Tolerance (ABFT) Schemes , 1996, IEEE Trans. Computers.
[2] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).
[3] Daniel Marques,et al. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[4] Suku Nair,et al. Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.
[5] Ian Foster,et al. The Globus toolkit , 1998 .
[6] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[7] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[8] Micah Beck,et al. Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .
[9] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[10] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[11] J.M.N. Vieira,et al. Stable DFT codes and frames , 2003, IEEE Signal Processing Letters.
[12] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.
[13] Nitin H. Vaidya,et al. A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.
[14] Ian Foster,et al. The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.
[15] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[16] P.J.S.G. Ferreira. Stability issues in error control coding in the complex field, interpolation, and frame bounds , 2000, IEEE Signal Processing Letters.
[17] Jack Dongarra,et al. Top500 Supercomputer Sites - 13th edition , 1998 .
[18] Tzi-cker Chiueh,et al. Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.
[19] A. James. Distributions of Matrix Variates and Latent Roots Derived from Normal Samples , 1964 .
[20] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[21] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .
[22] Alan Edelman,et al. Tails of Condition Number Distributions , 2005, SIAM J. Matrix Anal. Appl..
[23] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..
[24] Gene H. Golub,et al. Matrix computations , 1983 .
[25] Jean-Marc Azaïs,et al. Upper and Lower Bounds for the Tails of the Distribution of the Condition Number of a Gaussian Matrix , 2005, SIAM J. Matrix Anal. Appl..
[26] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[27] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[28] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.
[29] Ronald L. Graham,et al. Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.
[30] Ami Marowka,et al. The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..
[31] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..
[32] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[33] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[34] Zizhong Chen,et al. Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..
[35] Erik Seligman,et al. Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..
[36] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[37] James S. Plank,et al. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.
[38] Stanislaw J. Szarek,et al. Condition numbers of random matrices , 1991, J. Complex..
[39] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[40] Franklin T. Luk,et al. Algorithmic Fault Tolerance Using the Lanczos Method , 1992, SIAM J. Matrix Anal. Appl..
[41] S. Smale. On the efficiency of algorithms of analysis , 1985 .
[42] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..
[43] P. D. Hough,et al. Algorithm-dependent fault tolerance for distributed computing , 2000 .
[44] Jack Dongarra,et al. Fault-tolerant matrix operations for parallel and distributed systems , 1996 .
[45] Christoforos N. Hadjicostis,et al. Coding approaches to fault tolerance in linear dynamic systems , 2005, IEEE Transactions on Information Theory.
[46] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .
[47] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.
[48] David B. Johnson,et al. Distributed system fault tolerance using message logging and checkpointing , 1990 .
[49] A. Edelman. On the distribution of a scaled condition number , 1992 .
[50] James S. Plank. Efficient checkpointing on MIMD architectures , 1993 .
[51] Zizhong Chen,et al. Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..
[52] Werner Henkel. Multiple Error Correction with Analog Codes , 1988, AAECC.
[53] Franklin T. Luk,et al. An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..
[54] Daniel Marques,et al. C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.
[55] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[56] A. Edelman. Eigenvalues and condition numbers of random matrices , 1988 .
[57] Sathish S. Vadhiyar,et al. SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..
[58] Peter Sanders,et al. A bandwidth latency tradeoff for broadcast and reduction , 2003, Inf. Process. Lett..