Efficient Inter-cloud Replication for High-Availability Services*

Amazon's recent service disruption and investigations into the underlying causes of similar major outages indicate that cloud outages in future cannot be ruled out with certainty. This paper investigates the idea of tolerating outages by inter-cloud replication, i.e., through service replication on multiple, fail-independent clouds. A challenge in realizing this idea is to minimize performance degradation that inevitably arises when replicas on multiple clouds have to be kept in a mutually consistent state over the Internet. It is addressed by developing a new order protocol that makes the most use of the high bandwidth communication within a cloud and uses the Internet communication to minimum necessary. The protocol also deals with cloud outages and widely differing rates with which service requests can arrive at replicas in different clouds. Experiments performed confirm that the protocol reduces the ordering latencies considerably and also improves throughput.

[1]  Vyacheslav S. Kharchenko,et al.  Exploring Uncertainty of Delays as a Factor in End-to-End Cloud Response Time , 2012, 2012 Ninth European Dependable Computing Conference.

[2]  Vyacheslav S. Kharchenko,et al.  Benchmarking Dependability of a System Biology Application , 2009, 2009 14th IEEE International Conference on Engineering of Complex Computer Systems.

[3]  Fernando Pedone,et al.  High performance state-machine replication , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[4]  Marko Vukolic,et al.  Robust data sharing with key-value stores , 2012, DSN.

[5]  Luigi Rizzo,et al.  Dummynet: a simple approach to the evaluation of network protocols , 1997, CCRV.

[6]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[7]  B. Achiriloaie,et al.  VI REFERENCES , 1961 .

[8]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[9]  Ion Stoica,et al.  Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .

[10]  Fred B. Schneider,et al.  Replication management using the state-machine approach , 1993 .

[11]  André Schiper,et al.  S-Paxos: Offloading the Leader for High Throughput State Machine Replication , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[12]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[13]  Marko Vukolic,et al.  Robust data sharing with key-value stores , 2011, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[14]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.