Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities
暂无分享,去创建一个
[1] Edsger W. Dijkstra,et al. Self-stabilizing systems in spite of distributed control , 1974, CACM.
[2] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[3] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[4] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[5] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[6] Kai Li,et al. Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..
[7] Christian Engelmann,et al. Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .
[8] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[9] R. Vilalta,et al. Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .
[10] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[11] Daniel Marques,et al. C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.
[12] G. R. Liu,et al. 1013 Mesh Free Methods : Moving beyond the Finite Element Method , 2003 .
[13] Charng-da Lu,et al. Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .
[14] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[15] Stéphane Genaud,et al. P2P-MPI: A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids , 2007, Journal of Grid Computing.
[16] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[17] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[18] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[19] Zhiling Lan,et al. Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[20] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[21] F. Mueller,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Fatiha Bouabache,et al. Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[23] Chao Wang,et al. A tunable holistic resiliency approach for high-performance computing systems , 2009, PPoPP '09.
[24] George Bosilca,et al. Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..