FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing
暂无分享,去创建一个
Xuejun Yang | Jia Jia | Yunfei Du | Hongyi Fu | Panfeng Wang
[1] BeguelinAdam,et al. Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .
[2] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[3] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..
[4] Xuejun Yang,et al. The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[5] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[6] Thomas Hérault,et al. MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..
[7] George Bosilca,et al. Network Fault Tolerance in Open MPI , 2007, Euro-Par.
[8] L.M. Ni,et al. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..
[9] Charng-Da Lu,et al. Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..
[10] Tamara G. Kolda,et al. Asynchronous Parallel Pattern Search for Nonlinear Optimization , 2001, SIAM J. Sci. Comput..
[11] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[12] Ronald Minnich,et al. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.
[13] Mary Lou Soffa,et al. Efficient computation of interprocedural definition-use chains , 1994, TOPL.
[14] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[15] Marina C. Chen,et al. Generating explicit communication from shared-memory program references , 1990, Proceedings SUPERCOMPUTING '90.
[16] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[17] Xuejun Yang,et al. Compiler-Assisted Application-Level Checkpointing for MPI Programs , 2008, 2008 The 28th International Conference on Distributed Computing Systems.
[18] Laxmikant V. Kalé,et al. A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[19] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[20] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.
[21] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[22] Sung-Eun Choi,et al. Compiler-generated staggered checkpointing , 2004 .
[23] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[24] Paul D. Hovland,et al. Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).