Two Approaches for High Concurrency in Multicast-Based Object Replication

This report presents a replica control protocol for atomic objects. The protocol is derived from an atomic broadcast primitive, and places constraints on the delivery of messages to provide a consistent message order among sites. Several heuristic techniques are proposed to reduce the latency of message delivery, for two types of orders. Messages are delivered either in the same order for all sites, or in an order semantically equivalent to this unique ordering. The equivalence relation is based on the commutativity property of operations on objects, i.e: two deposit operations commute. The protocol uses a reliable causal multicast primitive, and is fully distributed. The rst set of heuristics is based on a voting scheme, and delivers messages in a unique order. Totally ordered atomic multicast can be built on top of a reliable causal multicast by waiting until each processor in the group has multicast a message, inserting them in a causal graph, and then delivering the roots of this graph. This latency can be reduced with a voting scheme that allows messages to be inserted in a total order without waiting for a message from all group members. Dolev, Kramer and Malki proposed such a protocol, but it is more restrictive than it needs to be, and can be extended for a wider range of operating conditions. In particular, we show how to deliver a message even if it is acknowledged by less than half of the processors. If there remains some undelivered messages (not yet placed in the total order), a second set of heuristics are available to a processor. If messages correspond to operations on objects, it can use local and type speciic information to minimize delivery constraints. More speciically, if operations commute, it is possible to initiate some of them early, provided that the execution is equivalent to a linear sequence of operation invocation, the same for all sites. The commutativity of operations depends on the state of the data, thus allowing more concurrency than otherwise possible. We present a general algorithm for ordering operations provided that messages are causally ordered, and give two heuristics to make it practical. Finally, we present a performance evaluation of the protocols based on discrete-event simulation. Our multicast primitives perform well, and adapt better than previous work to various network conngurations. It is also more scalable.

[1]  Leslie Lamport,et al.  The Byzantine generals , 1987 .

[2]  Michael Stonebraker,et al.  Readings in Database Systems , 1988 .

[3]  Divyakant Agrawal,et al.  The generalized tree quorum protocol: an efficient approach for managing replicated data , 1992, TODS.

[4]  Shivakant Mishra,et al.  Consul: a communication substrate for fault-tolerant distributed programs , 1993, Distributed Syst. Eng..

[5]  Piotr Berman,et al.  Voting as the Optimal Static Pessimistic Scheme for Managing Replicated Data , 1994, IEEE Trans. Parallel Distributed Syst..

[6]  Jo-Mei Chang,et al.  Reliable broadcast protocols , 1984, TOCS.

[7]  Sushil Jajodia,et al.  Dynamic voting algorithms for maintaining the consistency of a replicated database , 1990, TODS.

[8]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[9]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Shivakant Mishra,et al.  A Membership Protocol Based on Partial Order , 1992 .

[12]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[13]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[14]  Hector Garcia-Molina,et al.  Elections in a Distributed Computing System , 1982, IEEE Transactions on Computers.

[15]  Philip A. Bernstein,et al.  An algorithm for concurrency control and recovery in replicated distributed databases , 1984, TODS.

[16]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[17]  Derek L. Eager,et al.  Achieving robustness in distributed database systems , 1983, TODS.

[18]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[19]  Hector Garcia-Molina,et al.  The vulnerability of vote assignments , 1986, TOCS.

[20]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[21]  Hervé Moulin Fairness and strategy in voting , 1985 .

[22]  Akhil Kumar,et al.  Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data , 1991, IEEE Trans. Computers.

[23]  Barbara Liskov,et al.  Viewstamped Replication: A General Primary Copy , 1988, PODC.

[24]  Flaviu Cristian,et al.  Clock Synchronization in the Presence of Omission and Performance Faults, and Processor Joins , 1986 .

[25]  D. McCue,et al.  Fault-Tolerance in the Advanced Automation System , 1991, OPSR.

[26]  Hector Garcia-Molina,et al.  How to assign votes in a distributed system , 1985, JACM.

[27]  Vijay Kumar,et al.  Performance Measurement of Main Memory Database Recovery Algorithms Based on Update-in-Place and Shadow Approaches , 1992, IEEE Trans. Knowl. Data Eng..

[28]  R.K. Guy,et al.  On numbers and games , 1978, Proceedings of the IEEE.

[29]  Vivek Agrawala,et al.  Asynchronous Fault-Tolerant Total Ordering Algorithms , 1993, SIAM J. Comput..

[30]  Mostafa H. Ammar,et al.  The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data , 1992, IEEE Trans. Knowl. Data Eng..

[31]  William E. Weihl,et al.  Local atomicity properties: modular concurrency control for abstract data types , 1989, TOPL.

[32]  Danny Dolev,et al.  Early delivery totally ordered multicast in asynchronous environments , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[33]  M. Herlihy A quorum-consensus replication method for abstract data types , 1986, TOCS.

[34]  Philip D. Straffin,et al.  Topics in the theory of voting , 1980 .

[35]  Mostafa H. Ammar,et al.  Performance Characterization of Quorum-Consensus Algorithms for Replicated Data , 1989, IEEE Trans. Software Eng..

[36]  P.M. Melliar-Smith,et al.  Fault-tolerant distributed systems based on broadcast communication , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[37]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[38]  Virgil D. Gligor,et al.  A fault-tolerant protocol for atomic broadcast , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[39]  Robbert van Renesse,et al.  Voting with ghosts , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[40]  Maurice Herlihy,et al.  Concurrency and availability as dual properties of replicated atomic data , 1990, JACM.

[41]  Louise E. Moser,et al.  Fast message ordering and membership using a logical token-passing ring , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[42]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[43]  Akhil Kumar,et al.  Cost and availability tradeoffs in replicated data concurrency control , 1993, TODS.

[44]  Yair Amir,et al.  Membership Algorithms for Multicast Communication Groups , 1992, WDAG.

[45]  Andrew S. Tanenbaum,et al.  Efficient Reliable Group Communication for Distributed Systems , 1992 .

[46]  Louise E. Moser,et al.  Broadcast Protocols for Distributed Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[47]  Louise E. Moser,et al.  Membership algorithms for asynchronous distributed systems , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[48]  Richard D. Schlichting,et al.  Preserving and using context information in interprocess communication , 1989, TOCS.

[49]  William E. Weihl Linguistic support for atomic data types , 1990, TOPL.

[50]  Ambuj K. Singh,et al.  Consistency and orderability: semantics-based correctness criteria for databases , 1993, TODS.

[51]  Paul A. Fishwick,et al.  Simulation model design and execution - building digital worlds , 1995 .

[52]  Shivakant Mishra,et al.  Implementing fault-tolerant replicated objects using Psync , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[53]  Maurice Herlihy,et al.  Dynamic quorum adjustment for partitioned data , 1987, TODS.

[54]  Jon M. Peha,et al.  OSCAR: a system for weak-consistency replication , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[55]  K. M. Chandy,et al.  Incremental Recovery In Main Memory Database Systems , 1992 .

[56]  H. Moulin Axioms of Cooperative Decision Making , 1988 .

[57]  K. Mani Chandy,et al.  Parallel program design - a foundation , 1988 .

[58]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[59]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[60]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[61]  Mostafa H. Ammar,et al.  Optimizing vote and quorum assignments for reading and writing replicated data , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[62]  Danny Dolev,et al.  Total Ordering of Messages in Broadcast Domains , 1992 .

[63]  Calton Pu,et al.  Regeneration of replicated objects: A technique and its Eden implementation , 1986, 1986 IEEE Second International Conference on Data Engineering.

[64]  Hector Garcia-Molina,et al.  System M: A Transaction Processing Testbed for Memory Resident Data , 1990, IEEE Trans. Knowl. Data Eng..

[65]  J FischerMichael,et al.  Efficiency of Synchronous Versus Asynchronous Distributed Systems , 1983 .

[66]  Robbert van Renesse,et al.  Reliable Multicast between Micro-Kernels , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[67]  Stephen E. Deering,et al.  Host extensions for IP multicasting , 1986, RFC.

[68]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[69]  Hector Garcia-Molina,et al.  Protocols for dynamic vote reassignment , 1986, PODC '86.

[70]  William E. Weihl,et al.  Commutativity-based concurrency control for abstract data types , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[71]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[72]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[73]  William E. Weihl The impact of recovery on concurrency control , 1989, PODS '89.

[74]  Willy Zwaenepoel,et al.  Distributed process groups in the V Kernel , 1985, TOCS.

[75]  Hector Garcia-Molina,et al.  The Reliability of Voting Mechanisms , 1987, IEEE Transactions on Computers.