论文信息 - Checkpoint-restart for a network of virtual machines

Checkpoint-restart for a network of virtual machines

The ability to easily deploy parallel computations on the Cloud is becoming ever more important. The first uniform mechanism for checkpointing a network of virtual machines is described. This is important for the parallel versions of common productivity software. Potential examples of parallelism include Simulink for MATLAB, parallel R for the R statistical modelling language, parallel blast.py for the BLAST bioinformatics software, IPython.parallel for Python, and GNU parallel for parallel shells. The checkpoint mechanism is implemented as a plugin in the DMTCP checkpoint-restart package. It operates on KVM/QEMU, and has also been adapted to Lguest and pure user-space QEMU. The plugin is surprisingly compact, comprising just 400 lines of code to checkpoint a single virtual machine, and 200 lines of code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual filesystem are accommodated through the Btrfs filesystem. Experiments demonstrate checkpoint times of a fraction of a second by using forked checkpointing, mmap-based restart, and incremental Btrfs-based snapshots.

Zhengping Jin | Gene Cooperman | Rohan Garg | Komal Sodha

[1] Josef Bacik,et al. BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[2] Brian E. Granger,et al. IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[3] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[5] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[6] Gene Cooperman,et al. DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7] Daisuke Takahashi,et al. The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[8] Kasidit Chanchio,et al. Thread-Based Live Checkpointing of Virtual Machines , 2011, 2011 IEEE 10th International Symposium on Network Computing and Applications.

[9] Gabriel Antoniu,et al. BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[10] Franck Cappello,et al. BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11] Na Li,et al. Snow: A Parallel Computing Framework for the R System , 2009, International Journal of Parallel Programming.

[12] Gene Cooperman,et al. A Generic Checkpoint-Restart Mechanism for Virtual Machines , 2012, ArXiv.

[13] Geoffroy Vallée,et al. Checkpoint/Restart of Virtual Machines Based on Xen , 2006 .

[14] Mike Hibler,et al. Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[15] П. Довгалюк,et al. Два способа организации механизма полносистемного детерминированного воспроизведения в симуляторе QEMU , 2012 .

[16] Dutch T. Meyer,et al. Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[17] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18] Ravishankar K. Iyer,et al. Checkpointing virtual machines against transient errors , 2010, 2010 IEEE 16th International On-Line Testing Symposium.