论文信息 - Proactive process-level live migration in HPC environments

Proactive process-level live migration in HPC environments

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

Christian Engelmann | Frank Mueller | Chao Wang | Stephen L. Scott

[1] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.

[2] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[3] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[4] Chokchai Leangsuksun,et al. On the Survivability of Standard MPI Applications , 2006 .

[5] Eleanor Chu,et al. Minimizing Communication Penalty of Triangular Solvers by Runtime Mesh Configuration and Workload Redistribution , 2004, The Journal of Supercomputing.

[6] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[7] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[9] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[11] Remzi H. Arpaci-Dusseau,et al. Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12] Zhiling Lan,et al. Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[13] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[14] Barton P. Miller,et al. Process migration in DEMOS/MP , 1983, SOSP '83.

[15] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[16] Cong Du,et al. HPCM: a pre-compiler aided middleware for the mobility of legacy code , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[17] Cong Du,et al. MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[18] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[19] Howard Gobioff,et al. The Google file system , 2003, SOSP '03.

[20] Andrew Warfield,et al. Live migration of virtual machines , 2005, NSDI.

[21] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .

[22] Heather M. Quinn,et al. Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[23] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..

[24] Cong Du,et al. Dynamic Scheduling with Process Migration , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[25] Amnon Barak,et al. MOSIX: an integrated multiprocessor UNIX , 1999 .

[26] Christian Engelmann,et al. A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[27] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[28] Fred Douglis,et al. Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[29] Raja Nassar,et al. Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[30] Alan L. Cox,et al. Optimizing network virtualization in Xen , 2006 .

[31] Dhabaleswar K. Panda,et al. High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[32] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[33] Mark A. Taylor,et al. Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[34] Christine Morin,et al. Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).

[35] Minyi Guo,et al. Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[36] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[37] Andrew P. Black,et al. Fine-grained mobility in the Emerald system , 1987, TOCS.

[38] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[39] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[40] Laxmikant V. Kalé,et al. A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[41] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[42] Philip S. Yu,et al. Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[43] M. Litzkow. REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[44] Keith A. Lantz,et al. Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.

[45] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[46] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[47] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[48] Heinz W. Schmidt,et al. An agent oriented proactive fault-tolerant framework for grid computing , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).