Tashkent: uniting durability with transaction ordering for high-performance scalable database replication

In stand-alone databases, the functions of ordering the transaction commits and making the effects of transactions durable are performed in one single action, namely the writing of the commit record to disk. For efficiency many of these writes are grouped into a single disk operation. In replicated databases in which all replicas agree on the commit order of update transactions, these two functions are typically separated. Specifically, the replication middleware determines the global commit order, while the database replicas make the transactions durable.The contribution of this paper is to demonstrate that this separation causes a significant scalability bottleneck. It forces some of the commit records to be written to disk serially, where in a standalone system they could have been grouped together in a single disk write. Two solutions are possible: (1) move durability from the database to the replication middleware, or (2) keep durability in the database and pass the global commit order from the replication middleware to the database.We implement these two solutions. Tashkent-MW is a pure middleware solution that combines durability and ordering in the middleware, and treats an unmodified database as a black box. In Tashkent-API, we modify the database API so that the middleware can specify the commit order to the database, thus, combining ordering and durability inside the database. We compare both Tashkent systems to an otherwise identical replicated system, called Base, in which ordering and durability remain separated. Under high update transaction loads both Tashkent systems greatly outperform Base in throughput and response time.

[1]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[2]  Christos H. Papadimitriou,et al.  The Theory of Database Concurrency Control , 1986 .

[3]  Fernando Pedone,et al.  Database replication using generalized snapshot isolation , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[4]  Lei Gao,et al.  Application specific data replication for edge services , 2003, WWW '03.

[5]  Alan Fekete,et al.  Allocating isolation levels to transactions , 2005, PODS '05.

[6]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[7]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[8]  Ricardo Jiménez-Peris,et al.  Middleware based data replication providing snapshot isolation , 2005, SIGMOD '05.

[9]  Gustavo Alonso,et al.  Understanding replication in databases and distributed systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[10]  Bettina Kemme,et al.  Postgres-R(SI): combining replica control with concurrency control based on snapshot isolation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[12]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[13]  Gustavo Alonso,et al.  Ganymed: Scalable Replication for Transactional Web Applications , 2004, Middleware.

[14]  Calton Pu,et al.  Replica control in distributed systems: as asynchronous approach , 1991, SIGMOD '91.

[15]  Lars Frank Evaluation of the basic remote backup and replication methods for high availability databases , 1999 .

[16]  Gustavo Alonso,et al.  Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication , 2000, VLDB.

[17]  Dennis Shasha,et al.  Making snapshot isolation serializable , 2005, TODS.

[18]  Gustavo Alonso,et al.  A suite of database replication protocols based on group communication primitives , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[19]  Lars Frank,et al.  Evaluation of the basic remote backup and replication methods for high availability databases , 1999, Softw. Pract. Exp..