Open commit protocols tolerating commission failures

To ensure atomicity of transactions in distributed systems so-called 2-phase commit (2PC) protocols have been proposed. The basic assumption of these protocols is that the processing nodes involved in transactions are “sane,” i.e., they only fail with omission failures, and nodes eventually recover from failures. Unfortunately, this assumption is not realistic for so-called Open Distributed Systems (ODSs), in which nodes may have totally different reliability characteristics. In ODSs, nodes can be classified into trusted nodes (e.g., a banking server) and nontrusted nodes (e.g., a home PC requesting a remote banking service). While trusted nodes are assumed to be sane, nontrusted nodes may fail permanently and even cause commission failures to occur. In this paper, we propose a family of 2PC protocols that tolerate any number of omission failures at trusted nodes and any number of commission and omission failures at nontrusted nodes. The proposed protocols ensure that (at least) the trusted nodes participating in a transaction eventually terminate the transaction in a consistent manner. Unlike Byzantine commit protocols, our protocols do not incorporate mechanisms for achieving Byzantine agreement, which has advantages in terms of complexity: Our protocols have the same or only a slightly higher message complexity than traditional 2PC protocols.

[1]  Butler W. Lampson,et al.  Distributed Systems - Architecture and Implementation, An Advanced Course , 1981, Advanced Course: Distributed Systems.

[2]  Bruce G. Lindsay,et al.  Transaction management in the R* distributed database management system , 1986, TODS.

[3]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[4]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[5]  Roger L. Haskin,et al.  Recovery management in QuickSilver , 1988, TOCS.

[6]  Bruce G. Lindsay,et al.  Efficient commit protocols for the tree of processes model of distributed transactions , 1985, OPSR.

[7]  H. Garcia-Molina,et al.  Reliable distributed database management , 1987, Proceedings of the IEEE.

[8]  Stephen Fox,et al.  Overview of an Ada compatible distributed database manager , 1983, SIGMOD '83.

[9]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[10]  Gerard LeLann Chapter 15. Error recovery , 1981 .

[11]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[12]  Danny Dolev,et al.  DISTRIBUTED COMMIT WITH BOUNDED WAITING , 1982 .

[13]  Roger M. Needham,et al.  Using encryption for authentication in large networks of computers , 1978, CACM.

[14]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[15]  Laura M. Haas,et al.  Computation and communication in R*: a distributed database manager , 1984, TOCS.

[16]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[17]  Gérard Le Lann,et al.  Error Recovery , 1980, Advanced Course: Distributed Systems.

[18]  Kurt Rothermel,et al.  Open commit protocols for the tree of processes model , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[19]  Alfred Z. Spector Camelot : a distributed transaction facility for mach and the internet - an interim report , 1987 .

[20]  Michael Stonebraker,et al.  A Formal Model of Crash Recovery in a Distributed System , 1983, IEEE Transactions on Software Engineering.

[21]  E. F. Michiels,et al.  ISO/IEC 10026-3:1992 Information technology Open Systems Interconnection Distributed Transaction Processing part 3: Protocol specification , 1992 .

[22]  Lily B. Mummert,et al.  Camelot and Avalon: A Distributed Transaction Facility , 1991 .