论文信息 - The management of replication in a distributed system

The management of replication in a distributed system

The field of consistency control protocols for replicated data objects has existed for about ten years. Its birth coincides with the advent of distributed data bases and the communications technology required to support them. When data objects are replicated around a computer network, a protocol must be chosen to ensure a consistent view to an accessing process. The replicas of the data object are then said to be mutually consistent. The protocols used to insure mutual consistency are known as replica control or consistency control protocols. There are several advantages to a distributed system over a single processor system. Among these are increased computing power and the ability to tolerate partial failures due to the malfunction of individual components. The redundancy present in a distributed system has been the focus of much research in the area of distributed data base systems. Another benefit of this natural redundancy, along with the relatively independent failure modes of the processors, is that it allows the system to continue operation even after some of the processors have failed. This can be used to construct data objects that are robust in the face of partial system failures. The focus of this dissertation is the exploitation of the redundancy present in distribution systems in order to attain an increased level of fault tolerance for data objects. The use of replication as a method of increasing fault tolerance is a well-known technique. Replication introduces the additional complexity of maintaining mutual consistency among the replicas of the data object. The protocols that manage the replicated data and provide the user with a single consistent view of that data are studied, and a comprehensive analysis of the fault tolerance provided by several of the most promising protocols are presented. Several techniques are employed, including Markov analysis and discrete event simulation. Simulation is used to confirm and extend the results obtained using analytic techniques.

Darrell D. E. Long | Jehan-Francois Paris | D. Long | Jehan-Francois Pâris

[1] W. C. Carter,et al. Reliability modeling techniques for self-repairing computer systems , 1969, ACM '69.

[2] Philip A. Bernstein,et al. Concurrency control in a system for distributed databases (SDD-1) , 1980, TODS.

[3] Philip A. Bernstein,et al. The correctness of concurrency control mechanisms in a system for distributed databases (SDD-1) , 1980, TODS.

[4] Hector Garcia-Molina,et al. Policies for Dynamic Vote Reassignment , 1986, ICDCS.

[5] Philip A. Bernstein,et al. An algorithm for concurrency control and recovery in replicated distributed databases , 1984, TODS.

[6] Arnold O. Allen. Probability, Statistics, and Queueing Theory , 1978 .

[7] J. D. Day,et al. A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[8] Kenneth P. Birman. Replication and fault-tolerance in the ISIS system , 1985, SOSP 1985.

[9] Amr Elabbadi. Implementing Fault-Tolerant Distributed Objects , 1985 .