High performance deferred update replication

Replication is a well-known approach to implementing storage systems that can tolerate failures. Replicated storage systems are designed such that the state of the system is kept at several replicas. A replication protocol ensures that the failure of a replica is masked by the rest of the system, in a way that is transparent to its users. Replicated storage systems are among the most important building blocks in the design of large scale applications. Applications at scale are often deployed on top of commodity hardware, store a vast amount of data, and serve a large number of users. The larger the system, the higher its vulnerability to failures. The ability to tolerate failures is not the only desirable feature in a replicated system. Storage systems need to be efficient in order to accommodate requests from a large user base while achieving low response times. In that respect, replication can leverage multiple replicas to parallelize the execution of user requests. This thesis focuses on Deferred Update Replication (DUR), a well-established database replication approach. It provides high availability in that every replica can execute client transactions. In terms of performance, it is better than other replication techniques in that only one replica executes a given transaction while the other replicas only apply state changes. However, DUR suffers from the following drawback: each replica stores a full copy of the database, which has consequences in terms of performance. The first consequence is that DUR cannot take advantage of the aggregated memory available to the replicas. Our first contribution is a distributed caching mechanism that addresses the problem. It makes efficient use of the main memory of an entire cluster of machines, while guaranteeing strong consistency. The second consequence is that DUR cannot scale with the number of replicas. The throughput of a fully replicated system is inherently limited by the number of transactions that a single replica can apply to its local storage. We propose a scalable version of the DUR approach where the system state is partitioned in smaller replica sets. Transactions that access disjoint partitions are parallelized.

[1]  Fernando Pedone,et al.  MoSQL: an elastic storage engine for MySQL , 2013, SAC '13.

[2]  Gustavo Alonso,et al.  MIDDLE-R: Consistent database replication at the middleware level , 2005, TOCS.

[3]  Gustavo Alonso,et al.  Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication , 2000, VLDB.

[4]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[5]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[6]  Francisco Moura,et al.  Optimistic total order in wide area networks , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[7]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[8]  Fernando Pedone,et al.  Partial replication in the Database State Machine , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[9]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[12]  Fernando Pedone,et al.  Optimistic Algorithms for Partial Database Replication , 2006, OPODIS.

[13]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[14]  Luís E. T. Rodrigues,et al.  D2STM: Dependable Distributed Software Transactional Memory , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[15]  Luís E. T. Rodrigues,et al.  When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Data Replication , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[16]  Barbara Liskov,et al.  Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions , 1999 .

[17]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[18]  Michael J. Freedman,et al.  Stronger Semantics for Low-Latency Geo-Replicated Storage , 2013, NSDI.

[19]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[20]  Gustavo Alonso,et al.  Exploiting atomic broadcast in replicated databases , 1997 .

[21]  Fernando Pedone,et al.  Scalable deferred update replication , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[22]  Michael Stonebraker,et al.  E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing , 2014, Proc. VLDB Endow..

[23]  Ricardo Jiménez-Peris,et al.  Boosting Database Replication Scalability through Partial Replication and 1-Copy-Snapshot-Isolation , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[24]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..

[25]  Fernando Pedone,et al.  On the Inherent Cost of Atomic Broadcast and Multicast in Wide Area Networks , 2008, ICDCN.

[26]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[27]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[28]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[29]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[30]  Fernando Pedone,et al.  P-Store: Genuine Partial Replication in Wide Area Networks , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[31]  Marcos K. Aguilera,et al.  Surviving Congestion in Geo-Distributed Storage Systems , 2012, USENIX Annual Technical Conference.

[32]  André Schiper,et al.  From group communication to transactions in distributed systems , 1996, CACM.

[33]  Sameh Elnikety,et al.  Tashkent+: memory-aware load balancing and update filtering in replicated databases , 2007, EuroSys '07.

[34]  Ricardo Jiménez-Peris,et al.  Middleware based data replication providing snapshot isolation , 2005, SIGMOD '05.

[35]  Marc Shapiro,et al.  G-DUR: a middleware for assembling, analyzing, and improving transactional protocols , 2014, Middleware.

[36]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[37]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[38]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[39]  Francesc D. Muñoz-Escoí,et al.  SIPRe: a partial database replication protocol with SI replicas , 2008, SAC '08.

[40]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[41]  Sameh Elnikety,et al.  Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[42]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[43]  Divyakant Agrawal,et al.  Efficient Execution of Read-Only Transactions in Replicated Multiversion Databases , 1993, IEEE Trans. Knowl. Data Eng..

[44]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[45]  Hector Garcia-Molina,et al.  Read-only transactions in a distributed database , 1982, TODS.

[46]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[47]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[48]  U. Fritzke,et al.  Transactions on partially replicated data based on reliable and atomic multicasts , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[49]  Divyakant Agrawal,et al.  G-Store: a scalable data store for transactional multi key access in the cloud , 2010, SoCC '10.

[50]  Fernando Pedone,et al.  Optimistic Atomic Multicast , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[51]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[52]  Luís E. T. Rodrigues,et al.  From spontaneous total order to uniform total order: different degrees of optimistic delivery , 2006, SAC '06.

[53]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[54]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[55]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[56]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[57]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[58]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[59]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[60]  Marc Shapiro,et al.  Non-monotonic Snapshot Isolation: Scalable and Strong Consistency for Geo-replicated Transactional Systems , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[61]  Marcos K. Aguilera,et al.  Transactional storage for geo-replicated systems , 2011, SOSP.

[62]  Fernando Pedone,et al.  Parallel Deferred Update Replication , 2014, 2014 IEEE 13th International Symposium on Network Computing and Applications.

[63]  Rachid Guerraoui,et al.  Exploiting Atomic Broadcast in Replicated Databases , 1998, Euro-Par.

[64]  Rachid Guerraoui,et al.  The Database State Machine Approach , 2003, Distributed and Parallel Databases.