ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
暂无分享,去创建一个
[1] Laxmikant V. Kalé,et al. Overcoming scaling challenges in biomolecular simulations across multiple platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[2] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[3] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[4] Robert Jacob,et al. Toward an ultra-high resolution community climate system model for the BlueGene platform , 2007 .
[5] Leonid Oliker,et al. Scientific Application Performance on Candidate PetaScale Platforms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[7] Franck Cappello,et al. Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.
[8] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[9] Laxmikant V. Kalé,et al. Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[10] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[12] Becky Verastegui,et al. Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.
[13] Sathish S. Vadhiyar,et al. SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..
[14] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[15] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[16] James S. Plank,et al. An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .
[17] Gene Cooperman,et al. DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[18] Nicholas J. Wright,et al. WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[19] James S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and , 1997 .
[20] L. Kalé,et al. Towards Petascale Cosmological Simulations with ChaNGa , 2007 .