Fault tolerant remote procedure call

A scheme is presented that makes a remote procedure call (RPC) mechanism fault-tolerant to hardware failures. Fault tolerance is provided by replicating the procedure at a group of nodes, called a cluster. The copies in a cluster are linearly ordered. A call to a procedure is sent to the first copy in the cluster and is propagated internally to all other copies. In the event of failures, the first copy in the cluster that has not failed returns the result to the caller. The scheme is transparent to the user and supports nested procedure calls. It has been implemented on a network of Sun workstations making use of Sun's existing RPC mechanism.<<ETX>>

[1]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[2]  Amr El Abbadi,et al.  Implementing Fault-Tolerant Distributed Objects , 1985, IEEE Transactions on Software Engineering.

[3]  Eric C. Cooper Replicated distributed programs , 1985, SOSP 1985.

[4]  Gérard Le Lann,et al.  Distributed Systems - Towards a Formal Approach , 1977, IFIP Congress.

[5]  Santosh K. Shrivastava,et al.  The Design of a Reliable Remote Procedure Call Mechanism , 1982, IEEE Transactions on Computers.

[6]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[7]  Gérard Le Lann,et al.  Motivations, Objectives and Characterization of Distributed Systems , 1980, Advanced Course: Distributed Systems.

[8]  Jean-Charles Fabre,et al.  Distributed coupled actors: A Chorus proposal for reliability , 1982, ICDCS.

[9]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[10]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[11]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[12]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[13]  Bruce Jay Nelson Remote procedure call , 1981 .

[14]  Pankaj Jalote,et al.  Resilient Objects in Broadcast Networks , 1989, IEEE Trans. Software Eng..

[15]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[16]  Jean-Charles Fabre,et al.  Some Fault-Tolerant Aspects of the Chorus Distributed System , 1985, ICDCS.

[17]  Kwei-Jay Lin,et al.  Atomic Remote Procedure Call , 1985, IEEE Transactions on Software Engineering.

[18]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[19]  Bruce Walker,et al.  The LOCUS distributed operating system , 1983, SOSP '83.

[20]  Hector Garcia-Molina,et al.  Database Processing with Triple Modular Redundancy , 1986, Symposium on Reliability in Distributed Software and Database Systems.