Checkpoint-restart for a network of virtual machines

The ability to easily deploy parallel computations on the Cloud is becoming ever more important. The first uniform mechanism for checkpointing a network of virtual machines is described. This is important for the parallel versions of common productivity software. Potential examples of parallelism include Simulink for MATLAB, parallel R for the R statistical modelling language, parallel blast.py for the BLAST bioinformatics software, IPython.parallel for Python, and GNU parallel for parallel shells. The checkpoint mechanism is implemented as a plugin in the DMTCP checkpoint-restart package. It operates on KVM/QEMU, and has also been adapted to Lguest and pure user-space QEMU. The plugin is surprisingly compact, comprising just 400 lines of code to checkpoint a single virtual machine, and 200 lines of code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual filesystem are accommodated through the Btrfs filesystem. Experiments demonstrate checkpoint times of a fraction of a second by using forked checkpointing, mmap-based restart, and incremental Btrfs-based snapshots.

[1]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[2]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[3]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[5]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[6]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[8]  Kasidit Chanchio,et al.  Thread-Based Live Checkpointing of Virtual Machines , 2011, 2011 IEEE 10th International Symposium on Network Computing and Applications.

[9]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[10]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Na Li,et al.  Snow: A Parallel Computing Framework for the R System , 2009, International Journal of Parallel Programming.

[12]  Gene Cooperman,et al.  A Generic Checkpoint-Restart Mechanism for Virtual Machines , 2012, ArXiv.

[13]  Geoffroy Vallée,et al.  Checkpoint/Restart of Virtual Machines Based on Xen , 2006 .

[14]  Mike Hibler,et al.  Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[15]  П. Довгалюк,et al.  Два способа организации механизма полносистемного детерминированного воспроизведения в симуляторе QEMU , 2012 .

[16]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[17]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Ravishankar K. Iyer,et al.  Checkpointing virtual machines against transient errors , 2010, 2010 IEEE 16th International On-Line Testing Symposium.