Scalable, near-zero loss disaster recovery for distributed data stores

This paper presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site, thereby restricting the data loss window, due to disasters, to milliseconds. These goals pose a significant set of challenges related to consistency of the backup site's state, failures, and scalability. Slogger employs a combination of asynchronous log replication, intra-data center synchronized clocks, pipelining, batching, and a novel watermark service to address these challenges. Furthermore, Slogger is designed to be deployable as an "add-on" module in an existing distributed data store with few modifications to the original code base. Our evaluation, conducted on Slogger extensions to a 32-sharded version of LogCabin, an open source key-value store, shows that Slogger maintains a very small data loss window of 14.2 milliseconds which is near the optimal value in our evaluation setup. Moreover, Slogger reduces the length of the data loss window by 50% compared to incremental snapshotting technique without having any performance penalty on the primary data store. Furthermore, our experiments demonstrate that Slogger achieves our other goals of scalability, fault tolerance, and efficient failover to the backup data store when a disaster is declared at the primary data store.

[1]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[2]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[3]  Pedro Moreira,et al.  White rabbit: Sub-nanosecond timing distribution over ethernet , 2009, 2009 International Symposium on Precision Clock Synchronization for Measurement, Control and Communication.

[4]  Zafer Korkmaz,et al.  Understanding and applying precision time protocol , 2015, 2015 Saudi Arabia Smart Grid (SASG).

[5]  Anurag Gupta,et al.  Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases , 2017, SIGMOD Conference.

[6]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[7]  Hector Garcia-Molina,et al.  Management of a remote backup copy for disaster recovery , 1991, TODS.

[8]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[9]  A. Chervenak,et al.  Protecting File Systems : A Survey of Backup Techniques , 1998 .

[10]  David B. Lomet,et al.  High speed on-line backup when using logical log operations , 2000, SIGMOD '00.

[11]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[12]  Lakshmi Ganesh,et al.  Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance , 2009, FAST.

[13]  John Wilkes,et al.  Seneca: remote mirroring done write , 2003, USENIX Annual Technical Conference, General Track.

[14]  Hector Garcia-Molina,et al.  Overview of disaster recovery for transaction processing systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[15]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[16]  Bradley C. Kuszmaul,et al.  Everyone Loves File: File Storage Service (FSS) in Oracle Cloud Infrastructure , 2019, USENIX Annual Technical Conference.

[17]  Kang Lee,et al.  IEEE 1588 standard for a precision clock synchronization protocol for networked measurement and control systems , 2002, 2nd ISA/IEEE Sensors for Industry Conference,.

[18]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[19]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[20]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[21]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[22]  C. Mohan,et al.  An efficient and flexible method for archiving a data base , 1993, SIGMOD Conference.

[23]  Amin Vahdat,et al.  Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization , 2018, NSDI.

[24]  David L. Mills,et al.  Internet time synchronization: the network time protocol , 1991, IEEE Trans. Commun..

[25]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[26]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[27]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[28]  Tianyin Xu,et al.  Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently , 2018, OSDI.

[29]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[30]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[31]  Steve R. Kleiman,et al.  SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery , 2002, FAST.

[32]  Lidong Zhou,et al.  Graceful degradation via versions: specifications and implementations , 2007, PODC '07.

[33]  Kuang-Ching Wang,et al.  The Design and Operation of CloudLab , 2019, USENIX ATC.

[34]  Hakim Weatherspoon,et al.  Globally Synchronized Time via Datacenter Networks , 2016, SIGCOMM.

[35]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[36]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.