Proactive process-level live migration in HPC environments
暂无分享,去创建一个
[1] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[2] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[3] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.
[4] Chokchai Leangsuksun,et al. On the Survivability of Standard MPI Applications , 2006 .
[5] Eleanor Chu,et al. Minimizing Communication Penalty of Triangular Solvers by Runtime Mesh Configuration and Workload Redistribution , 2004, The Journal of Supercomputing.
[6] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[7] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[8] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[9] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[10] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.
[11] Remzi H. Arpaci-Dusseau,et al. Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).
[12] Zhiling Lan,et al. Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[13] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[14] Barton P. Miller,et al. Process migration in DEMOS/MP , 1983, SOSP '83.
[15] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[16] Cong Du,et al. HPCM: a pre-compiler aided middleware for the mobility of legacy code , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.
[17] Cong Du,et al. MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[18] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[19] Howard Gobioff,et al. The Google file system , 2003, SOSP '03.
[20] Andrew Warfield,et al. Live migration of virtual machines , 2005, NSDI.
[21] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .
[22] Heather M. Quinn,et al. Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).
[23] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..
[24] Cong Du,et al. Dynamic Scheduling with Process Migration , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).
[25] Amnon Barak,et al. MOSIX: an integrated multiprocessor UNIX , 1999 .
[26] Christian Engelmann,et al. A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.
[27] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[28] Fred Douglis,et al. Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..
[29] Raja Nassar,et al. Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).
[30] Alan L. Cox,et al. Optimizing network virtualization in Xen , 2006 .
[31] Dhabaleswar K. Panda,et al. High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.
[32] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[33] Mark A. Taylor,et al. Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[34] Christine Morin,et al. Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).
[35] Minyi Guo,et al. Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).
[36] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[37] Andrew P. Black,et al. Fine-grained mobility in the Emerald system , 1987, TOCS.
[38] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[39] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[40] Laxmikant V. Kalé,et al. A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[41] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.
[42] Philip S. Yu,et al. Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.
[43] M. Litzkow. REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .
[44] Keith A. Lantz,et al. Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.
[45] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[46] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[47] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[48] Heinz W. Schmidt,et al. An agent oriented proactive fault-tolerant framework for grid computing , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).