An Application Level Approach for Proactive Process Migration in MPI Applications
暂无分享,去创建一个
Gabriel Rodríguez | Patricia González | María J. Martín | Iván Cores | P. González | Gabriel Rodríguez | María J. Martín | Iván Cores
[1] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[2] Alexandru Iosup,et al. On the dynamic resource availability in grids , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.
[3] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[4] Gabriel Rodríguez,et al. A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes , 2009, J. Univers. Comput. Sci..
[5] Rajendra Singh,et al. Performance Driven Partial Checkpoint/Migrate for LAM-MPI , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.
[6] Franck Cappello,et al. Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.
[7] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[8] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[9] John Paul Walters,et al. Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.
[10] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[11] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[12] Kai Li,et al. CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).
[13] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[14] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[15] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[16] Gabriel Rodríguez,et al. CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications , 2010 .
[17] Erich Strohmaier,et al. Linearly scaling 3D fragment method for large-scale electronic structure calculations , 2008, HiPC 2008.
[18] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[19] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[20] Cong Du,et al. MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[21] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.
[22] Gabriel Rodríguez,et al. CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..
[23] Hui Xiong,et al. Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.