Improving the performance of hypervisor-based fault tolerance

Hypervisor-based fault tolerance (HBFT), a checkpoint-recovery mechanism, is an emerging approach to sustaining mission-critical applications. Based on virtualization technology, HBFT provides an economic and transparent solution. However, the advantages currently come at the cost of substantial overhead during failure-free, especially for memory intensive applications. This paper presents an in-depth examination of HBFT and options to improve its performance. Based on the behavior of memory accesses among checkpointing epochs, we introduce two optimizations, read fault reduction and write fault prediction, for the memory tracking mechanism. These two optimizations improve the mechanism by 31.1% and 21.4% respectively for some application. Then, we present softwaresuperpage which efficiently maps large memory regions between virtual machines (VM). By the above optimizations, HBFT is improved by a factor of 1.4 to 2.2 and it achieves a performance which is about 60% of that of the native VM.

[1]  Garth R. Goodson,et al.  Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances , 2009, USENIX ATC.

[2]  Alan L. Cox,et al.  Optimizing network virtualization in Xen , 2006 .

[3]  George Varghese,et al.  Difference engine , 2010, OSDI.

[4]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[5]  Tzi-cker Chiueh,et al.  Fast memory state synchronization for virtualization-based fault tolerance , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[8]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, , 2002 .

[9]  Dave Hansen,et al.  Hotplug Memory and the Linux VM , 2004 .

[10]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[12]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[13]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[14]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[15]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[16]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[17]  Andrew Lumsdaine,et al.  Interconnect agnostic checkpoint/restart in open MPI , 2009, HPDC '09.

[18]  Brian N. Bershad,et al.  Reducing TLB and memory overhead using online superpage promotion , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[20]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1999, IEEE Trans. Computers.

[21]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[22]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[23]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.