论文信息 - Fault-tolerant parallel programming in Argus

Fault-tolerant parallel programming in Argus

Fault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high-performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large-scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault-tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application-specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault-tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade-offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be highly useful.

Henri E. Bal

[1] Udi Manber,et al. DIB—a distributed implementation of backtracking , 1987, TOPL.

[2] William E. Weihl,et al. Atomic data abstractions in a distributed collaborative editing system , 1986, POPL '86.

[3] Craig Schaffert,et al. Abstraction mechanisms in CLU , 1977, CACM.

[4] Jeffrey F. Naughton,et al. Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[5] Henri E. Bal,et al. Experience with distributed programming in Orca , 1990, Proceedings. 1990 International Conference on Computer Languages.

[6] E. B. Moss,et al. Nested Transactions: An Approach to Reliable Distributed Computing , 1985 .

[7] Sartaj Sahni,et al. All Pairs Shortest Paths on a Hypercube Multiprocessor , 1987, ICPP.

[8] Susumu Horiguchi,et al. A Parallel Sorting Algorithm for a Linearly Connected Multiprocessor System , 1986, ICDCS.

[9] William E. Weihl,et al. Implementation of resilient, atomic data types , 1985, TOPL.

[10] Anita Borg,et al. A message system supporting fault tolerance , 1983, SOSP '83.

[11] Henri E. Bal. Programming distributed systems , 1990 .