Efficient Snapshot Isolation in Paxos-Replicated Database Systems

Modern database systems are increasingly deployed in a cluster of commodity machines with Paxos-based replication technique to offer better performance, higher availability and fault-tolerance. The widely adopted implementation is that one database replica is elected to be a leader and to be responsible for transaction requests. After the transaction execution is completed, the leader generates transaction log and commit this transaction until the log has been replicated to a majority of replicas. The state of the leader is always ahead of that of the follower replicas since the leader commits the transactions firstly and then notifies other replicas of the latest committed log entries in the later communication. As the follower replica can’t immediately provide the latest snapshot, both read-write and read-only transactions would be executed at the leader to guarantee the strong snapshot isolation semantic. In this work, we design and implement an efficient snapshot isolation scheme. This scheme uses adaptive timestamp allocation to avoid frequently requesting the leader to assign transaction timestamps. Furthermore, we design an early log replay mechanism for follower replicas. It allows the follower replica to execute a read operation without waiting to replay log to generate the required snapshot. Comparing with the conventional implementation, we experimentally show that the optimized snapshot isolation for Paxos-replicated database systems has better performance in terms of scalability and throughput.

[1]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[2]  Ricardo Jiménez-Peris,et al.  Middleware based data replication providing snapshot isolation , 2005, SIGMOD '05.

[3]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[4]  Gustavo Alonso,et al.  Database replication , 2010, Proc. VLDB Endow..

[5]  David G. Andersen,et al.  Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[6]  Fernando Pedone,et al.  Database replication using generalized snapshot isolation , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[7]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[8]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[9]  Kenneth Salem,et al.  Lazy database replication with snapshot isolation , 2006, VLDB.

[10]  Sameh Elnikety,et al.  One-copy serializability with snapshot isolation under the hood , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[12]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[13]  Gustavo Alonso,et al.  A suite of database replication protocols based on group communication primitives , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[14]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[15]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[16]  Alan Fekete,et al.  Serializable snapshot isolation for replicated databases in high-update scenarios , 2011, Proc. VLDB Endow..

[17]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[18]  M. Tamer Özsu,et al.  ConfluxDB: Multi-Master Replication for Partitioned Snapshot Isolation Databases , 2014, Proc. VLDB Endow..

[19]  Andrew Pavlo,et al.  An Empirical Evaluation of In-Memory Multi-Version Concurrency Control , 2017, Proc. VLDB Endow..

[20]  Norman May,et al.  Distributed snapshot isolation: global transactions pay globally, local transactions pay locally , 2014, The VLDB Journal.

[21]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[22]  André Schiper,et al.  Comparison of database replication techniques based on total order broadcast , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[24]  Gustavo Alonso,et al.  Understanding replication in databases and distributed systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[25]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[26]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..

[27]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .