Radiata: Enabling Whole System Hot-mirroring via Continual State Replication

Checkpoint-recovery based on system virtualization is an attractive approach for providing the transparent and economic fault tolerance service in virtualized environments. The previous approaches introduce either great performance degradation or complex implementation issues. In this work, we propose a whole system hot-mirroring platform, namely Radiata, to provide fault-tolerance for any type of service by encapsulating the service instance into a virtual machine, and hot-mirroring the state changes of the virtual machine via the continual state replication. Our approach exploits three key optimizations for further reduction of the performance overhead: the asynchronous state replication, the COW-based memory checkpoint and the dirty page prediction. Based on the KVM platform, we have implemented the prototype system. The comprehensive evaluations under a variety of workloads demonstrate that Radiata is able to effectively support rapid and transparent fail-over in case of unexpected hardware failure, and outperforms the existing mechanisms in terms of the performance degradation in failure-free condition.

[1]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[2]  Peter M. Chen,et al.  Execution replay of multiprocessor virtual machines , 2008, VEE '08.

[3]  Andrew Warfield,et al.  RemusDB: transparent high availability for database systems , 2011, The VLDB Journal.

[4]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[5]  Douglas M. Blough,et al.  Fast, Lightweight Virtual Machine Checkpointing , 2010 .

[6]  Satish Narayanasamy,et al.  Respec: efficient online multiprocessor replayvia speculation and external determinism , 2010, ASPLOS XV.

[7]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[9]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[10]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[11]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[12]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[13]  Tzi-cker Chiueh,et al.  Fast memory state synchronization for virtualization-based fault tolerance , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[14]  K. Shin,et al.  HydraVM : Low-Cost , Transparent High Availability for Virt ual Machines , 2011 .

[15]  Jie Ma,et al.  Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration , 2010, 2010 IEEE International Conference on Cluster Computing.

[16]  Qin Li,et al.  Enhancing Reliability for Virtual Machines via Continual Migration , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[17]  Yellu Sreenivasulu,et al.  FAST TRANSPARENT MIGRATION FOR VIRTUAL MACHINES , 2014 .

[18]  Ganesh Venkitachalam,et al.  The design of a practical system for fault-tolerant virtual machines , 2010, OPSR.

[19]  Wei Dong,et al.  Improving the performance of hypervisor-based fault tolerance , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.