Understanding Non-Blocking Atomic Commitment

In distributed database systems, an atomic commitment protocol ensures that transactions terminate consistently at all participating sites even in the presence of failure. An atomic commitment protocol is said to be non-blocking if it permits transaction termination to proceed at correct participants despite failure of other. In large-scale distributed database systems, where failures may be frequent events, protocols that have this property are particularly desirable since they limit the time intervals during which transactions may be holding valuable resources. In this paper, we show how non-blocking atomic commitment protocols can be obtained through slight modifications of the well-known Two-Phase Commit (2PC) protocol, which is known to be blocking. Our approach is modular in the sense that both the protocols and their proofs of correctness are obtained by plugging in the appropriate reliable broadcast algorithms as the basic communication primitives in the original 2PC protocol. The resulting protocols are not only conceptually simple, they are also efficient in terms of time and message complexity.

[1]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[2]  C. Mohan,et al.  Method for distributed transaction commit and recovery using Byzantine Agreement within clusters of processors , 1983, PODC '83.

[3]  Dale Skeen,et al.  Crash recovery in a distributed database system , 1982 .

[4]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[5]  Vassos Hadzilacos,et al.  Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[6]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[7]  Brian A. Coan,et al.  Transaction commit in a realistic fault model , 1986, PODC '86.

[8]  Hector Garcia-Molina,et al.  Elections in a Distributed Computing System , 1982, IEEE Transactions on Computers.

[9]  Dale Skeen Determining the last process to fail , 1983, PODS '83.

[10]  Joseph Y. Halpern,et al.  Knowledge and common knowledge in a distributed environment , 1984, JACM.

[11]  Vassos Hadzilacos,et al.  On the Relationship Between the Atomic Commitment and Consensus Problems , 1990, Fault-Tolerant Distributed Computing.

[12]  Sam Toueg,et al.  Time and Message Efficient Reliable Broadcasts , 1990, WDAG.

[13]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[14]  Hector Garcia-Molina,et al.  Applications of Byzantine agreement in database systems , 1986, TODS.

[15]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[16]  Danny Dolev,et al.  DISTRIBUTED COMMIT WITH BOUNDED WAITING , 1982 .

[17]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[18]  Gil Neiger,et al.  Automatically Increasing the Fault-Tolerance of Distributed Algorithms , 1990, J. Algorithms.

[19]  Jim Gray A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem , 1986, Fault-Tolerant Distributed Computing.