Live Migration Ate My VM: Recovering a Virtual Machine after Failure of Post-Copy Live Migration

Post-copy is one of the two key techniques (besides pre-copy) for live migration of virtual machines in data centers. Post-copy provides deterministic total migration time and low downtime for write-intensive VMs. However, if post-copy migration fails for any reason, the migrating VM is lost because the VM’s latest consistent state is split between the source and destination nodes during migration. In this paper, we present PostCopyFT, a new approach to recover a VM after a destination or network failure during post-copy live migration using an efficient reverse incremental checkpointing mechanism. We have implemented and evaluated our approach in the KVM/QEMU platform. Our experimental results show that the total migration time of post-copy remains unchanged while maintaining low failover time, downtime, and application performance overhead.

[1]  Kartik Gopalan,et al.  Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning , 2009, VEE '09.

[2]  Umesh Deshpande,et al.  Scatter-Gather Live Migration of Virtual Machines , 2018, IEEE Transactions on Cloud Computing.

[3]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[4]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[5]  Amnon Barak,et al.  MOSIX: an integrated multiprocessor UNIX , 1999 .

[6]  Robbert van Renesse,et al.  Amoeba A Distributed Operating System for the 1990 s Sape , 1990 .

[7]  Yellu Sreenivasulu,et al.  FAST TRANSPARENT MIGRATION FOR VIRTUAL MACHINES , 2014 .

[8]  Patrick Th. Eugster,et al.  VNsnap: Taking Snapshots of Virtual Networked Infrastructures in the Cloud , 2012, IEEE Transactions on Services Computing.

[9]  Jaejin Lee,et al.  Fast and space-efficient virtual machine checkpointing , 2011, VEE '11.

[10]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[11]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[12]  Albert G. Greenberg,et al.  Fault-tolerant stream processing using a distributed, replicated file system , 2008, Proc. VLDB Endow..

[13]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Kartik Gopalan,et al.  Quick Eviction of Virtual Machines through Proactive Snapshots , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[15]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[16]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[17]  Kartik Gopalan,et al.  Quick Eviction of Virtual Machines through Proactive Live Snapshots , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[18]  Danny Jones,et al.  VM Live Migration At Scale , 2018, VEE.

[19]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[20]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[21]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[22]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[23]  Tzi-cker Chiueh,et al.  Fast memory state synchronization for virtualization-based fault tolerance , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[24]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[25]  Cheng Wang,et al.  A Fast, General Storage Replication Protocol for Active-Active Virtual Machine Fault Tolerance , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[26]  Jian Li,et al.  COLO: COarse-grained LOck-stepping virtual machines for non-stop service , 2013, SoCC.

[27]  George G. Robertson,et al.  Accent: A communication oriented network operating system kernel , 1981, SOSP.

[28]  Umesh Deshpande,et al.  Fast Server Deprovisioning through Scatter-Gather Live Migration of Virtual Machines , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[29]  Vincenzo Piuri,et al.  Fault Tolerance Management in Cloud Computing: A System-Level Perspective , 2013, IEEE Systems Journal.

[30]  Daniel Marques,et al.  Optimizing checkpoint sizes in the C3 system , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[31]  P. Strevens Iii , 1985 .

[32]  Peter M. Chen,et al.  Execution replay for intrusion analysis , 2006 .

[33]  Jonathan Adams,et al.  Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources , 2001, USENIX Annual Technical Conference, General Track.

[34]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[35]  Umesh Deshpande,et al.  Post-copy live migration of virtual machines , 2009, OPSR.