Proactive process-level live migration and back migration in HPC environments

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.

[1]  Philip S. Yu,et al.  Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[2]  M. Litzkow REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[3]  Keith A. Lantz,et al.  Preemptable remote execution facilities for the V-system , 1985, SOSP 1985.

[4]  Manav Vasavada Innovative Schemes to Suppport Incremental Checkpointing. , 2010 .

[5]  Marion G. Harmon,et al.  Transparent Real-Time Monitoring in MPI , 1999, IPPS/SPDP Workshops.

[6]  Kazimierz Wiatr,et al.  Loop profiling tool for HPC code inspection as an efficient method of FPGA based acceleration , 2010, Int. J. Appl. Math. Comput. Sci..

[7]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[9]  Zhiling Lan,et al.  Towards a Fault-aware Computing Environment , 2008 .

[10]  Chokchai Leangsuksun,et al.  On the Survivability of Standard MPI Applications , 2006 .

[11]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[12]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[13]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[14]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[15]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[16]  Alan L. Cox,et al.  Optimizing network virtualization in Xen , 2006 .

[17]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[18]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[19]  Laxmikant V. Kalé,et al.  A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  Laxmikant V. Kalé,et al.  Performance evaluation of adaptive MPI , 2006, PPoPP '06.

[21]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[22]  Heather M. Quinn,et al.  Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[23]  Dhabaleswar K. Panda,et al.  High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[24]  Fabrizio Petrini,et al.  Towards Fault Resilient Global Arrays , 2007, PARCO.

[25]  Christine Morin,et al.  Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).

[26]  A. Lumsdaine,et al.  A Checkpoint and Restart Service Specification for Open MPI , 2006 .

[27]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[28]  Cong Du,et al.  HPCM: a pre-compiler aided middleware for the mobility of legacy code , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[29]  Cong Du,et al.  Dynamic Scheduling with Process Migration , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[30]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[31]  Amnon Barak,et al.  MOSIX: an integrated multiprocessor UNIX , 1999 .

[32]  Christian Engelmann,et al.  A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[33]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[34]  Zhiling Lan,et al.  Fault-Driven Re-Scheduling For Improving System-level Fault Resilience , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[35]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[36]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[37]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[38]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[39]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[40]  S. Scott,et al.  Toward efficient failure detection and recovery in HPC , 2006 .

[41]  Zhiling Lan,et al.  Fault-Aware Runtime Strategies for High-Performance Computing , 2009, IEEE Transactions on Parallel and Distributed Systems.

[42]  Zhen Liu,et al.  Lightweight monitoring of MPI programs in real time , 2005, Concurr. Comput. Pract. Exp..

[43]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[44]  Andrew P. Black,et al.  Fine-grained mobility in the Emerald system , 1987, TOCS.

[45]  Jaspal Subhlok,et al.  VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes , 2009, PVM/MPI.

[46]  Minyi Guo,et al.  Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[47]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[48]  Raja Nassar,et al.  Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[49]  Dhabaleswar K. Panda,et al.  RDMA-Based Job Migration Framework for MPI over InfiniBand , 2010, 2010 IEEE International Conference on Cluster Computing.

[50]  Mark A. Taylor,et al.  Architecture of LA-MPI, a network-fault-tolerant MPI , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[51]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[52]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[53]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[54]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[55]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[56]  Kishor S. Trivedi,et al.  Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[57]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[58]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[59]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[60]  Chao Wang,et al.  Scalable, fault tolerant membership for MPI tasks on HPC systems , 2006, ICS '06.

[61]  GhemawatSanjay,et al.  The Google file system , 2003 .

[62]  Heinz W. Schmidt,et al.  An agent oriented proactive fault-tolerant framework for grid computing , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[63]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[64]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[65]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[66]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[67]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[68]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[69]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[70]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[71]  Remzi H. Arpaci-Dusseau,et al.  Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).