The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines, based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server. We have designed a complete system in VMware vSphere 4.0 that is easy to use, runs on commodity servers, and typically reduces performance of real applications by less than 10%. Our method for replicating VM execution is similar to that described in Bressoud [3], but we have made a number of significant design changes that greatly improve performance. In addition, an easy-touse, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encountered in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide an evaluation of our

[1]  Harrick M. Vin,et al.  A fault-tolerant java virtual machine , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[2]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[3]  Min Xu ReTrace : Collecting Execution Trace with Virtual Machine Deterministic Replay , 2007 .

[4]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[5]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[6]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[7]  Thomas C. Bressoud,et al.  TFT: a software system for application-transparent fault tolerance , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[8]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[9]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[10]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[11]  Roy Friedman,et al.  Transparent fault-tolerant Java virtual machine , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[12]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.