Publishing: a reliable broadcast communication mechanism

Publishing is a model and mechanism for crash recovery in a distributed computing environment. Published communication works for systems connected via a broadcast medium by recording messages transmitted over the network. The recovery mechanism can be completely transparent to the failed process and all processes interacting with it. Although published communication is intended for a broadcast network such as a bus, a ring, or an Ethernet, it can be used in other environments. A recorder reliably stores all messages that are transmitted, as well as checkpoint and recovery information. When it detects a failure, the recorder may restart affected processes from checkpoints. The recorder subsequently resends to each process all messages which were sent to it since the time its checkpoint was taken, while ignoring duplicate messages sent by it. Message-based systems without shared memory can use published communications to recover groups of processes. Simulations show that at least 5 multi-user minicomputers can be supported on a standard Ethernet using a single recorder. The prototype version implemented in DEMOS/MP demonstrates that an error recovery can be transparent to user processes and can be centralized in the network.

[1]  Robert Metcalfe,et al.  Ethernet: distributed packet switching for local computer networks , 1988, CACM.

[2]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[3]  M. Tokoro,et al.  Acknowledging Ethernet , 1977 .

[4]  Gene McDaniel,et al.  METRIC: A Kernel Instrumentation System for Distributed Environments. , 1977, SOSP 1977.

[5]  Gene McDaniel Metric (Extended Abstract): A kernel instrumentation system for distributed environments , 1977, SOSP '77.

[6]  Michael L. Powell,et al.  The DEMOS file system , 1977, SOSP '77.

[7]  Forest Baskett,et al.  Task communication in DEMOS , 1977, SOSP '77.

[8]  K. Thompson,et al.  The UNIX time-sharing system , 1978 .

[9]  Joost Verhofstad,et al.  Recovery Techniques for Database Systems , 1978, CSUR.

[10]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[11]  Brian Randell Reliable Computing Systems , 1978, Advanced Course: Operating Systems.

[12]  Michael Hammer,et al.  Reliability mechanisms for SDD-1: a system for distributed databases , 1980, TODS.

[13]  L. Svobodova MANAGEMENT OF OBJECT HISTORIES IN THE SWALLOW REPOSITORY , 1980 .

[14]  John F. Shoch,et al.  Measured performance of an Ethernet local network , 1980, CACM.

[15]  Ming T. Liu,et al.  The Distributed Double-Loop Computer Network (DDLCN) , 1980, ACM '80.

[16]  William Kevin Wilkinson Database concurrency control and recovery in local broadcast networks , 1981 .

[17]  G. C. Arens RECOVERY OF THE SWALLOW REPOSITORY , 1981 .

[18]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[19]  Fred B. Schneider,et al.  Synchronization in Distributed Programs , 1982, TOPL.

[20]  Robert Metcalfe,et al.  Ethernet: distributed packet switching for local computer networks , 1976, CACM.

[21]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[22]  Michael Stonebraker,et al.  A Formal Model of Crash Recovery in a Distributed System , 1983, IEEE Transactions on Software Engineering.

[23]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.