Thread-Based Live Checkpointing of Virtual Machines

Virtual machine check pointing is the mechanism to save virtual machine state to a file for later recovery. Traditional check pointing mechanisms can suffer a long delay and cause a long disruption of services since they have to stop virtual machines to save state, which could be large. In this study, a novel Thread-based Live Check pointing (TLC) mechanism is proposed. This mechanism leverages the pre-copy live migration mechanism introducing a checkpoint thread, which is responsible for the majority of the check pointing activities. While the checkpoint thread is saving the virtual machine state to persistent storage, the virtual machine thread is allowed to progress with normal execution. However, the virtual machine thread will be periodically interrupted to incrementally copy dirty memory pages to a hash table. The interruptions will occur until the final stage of check pointing is reached. This approach is implemented in KVM and its performance evaluations are conducted using NAS parallel benchmarks. Experiments show that this approach can provide high levels of virtual machine responsiveness during check pointing. It can also reduce the check pointing overheads to as low as 0.53 times of that of the traditional approach, when operating on a virtual machine running memory intensive workloads.

[1]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[2]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[3]  BeguelinAdam,et al.  Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .

[4]  Xian-He Sun,et al.  Data collection and restoration for heterogeneous process migration , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[5]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Xian-He Sun,et al.  Data collection and restoration for heterogeneous process migration , 2002, Softw. Pract. Exp..

[8]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[9]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[11]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[12]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[13]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .