Fault-tolerant parallel programming in Argus

Fault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high-performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large-scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault-tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application-specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault-tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade-offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be highly useful.

[1]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[2]  William E. Weihl,et al.  Atomic data abstractions in a distributed collaborative editing system , 1986, POPL '86.

[3]  Craig Schaffert,et al.  Abstraction mechanisms in CLU , 1977, CACM.

[4]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[5]  Henri E. Bal,et al.  Experience with distributed programming in Orca , 1990, Proceedings. 1990 International Conference on Computer Languages.

[6]  E. B. Moss,et al.  Nested Transactions: An Approach to Reliable Distributed Computing , 1985 .

[7]  Sartaj Sahni,et al.  All Pairs Shortest Paths on a Hypercube Multiprocessor , 1987, ICPP.

[8]  Susumu Horiguchi,et al.  A Parallel Sorting Algorithm for a Linearly Connected Multiprocessor System , 1986, ICDCS.

[9]  William E. Weihl,et al.  Implementation of resilient, atomic data types , 1985, TOPL.

[10]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[11]  Henri E. Bal Programming distributed systems , 1990 .

[12]  Henri E. Bal,et al.  Programming languages for distributed computing systems , 1989, CSUR.

[13]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[14]  Barbara Liskov,et al.  Guardians and Actions: Linguistic Support for Robust, Distributed Programs , 1983, TOPL.

[15]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[16]  Barbara Liskov,et al.  Distributed programming in Argus , 1988, CACM.

[17]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[18]  Barbara Liskov,et al.  Argus-a programming language and system developed to support the implementation and execution of distributed programs-provides mechanisms that help programmers cope with the special problems that arise in distributed programs, such as network partitions and crashes of remote nodes. , 1988 .

[19]  Henri E. Bal,et al.  Distributed programming with shared data , 1988, Proceedings. 1988 International Conference on Computer Languages.