NAP: practical fault-tolerance for itinerant computations

One use of mobile agents is support for itinerant computation (D. Chess et al., 1995). An itinerant computation is a program that moves from host to host in a network. Which hosts the program visits is determined by the program. The program can have a pre-defined itinerary or can dynamically compute the next host to visit as it visits each successive host; it can visit the same host repeatedly or it can even create multiple concurrent copies of itself on a single host. Itinerant computations are susceptible to processor failures, communications failures, and crashes due to program bugs. NAP is a protocol for supporting fault tolerance in itinerant computations. It employs a form of failure detection and recovery, and it generalizes the primary backup approach to a new computational model. The guarantees offered by NAP as well as an implementation for NAP in TACOMA are discussed.

[1]  Markus Straßer,et al.  A fault-tolerant protocol for providing the exactly-once property of mobile agents , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[2]  Fred B. Schneider,et al.  Primary-Backup Protocols: Lower Bounds and Optimal Implementations , 1992 .

[3]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[4]  Richard D. Schlichting,et al.  Fail-Stop Processors: An Approach to Designing Computing Systems , 1983 .

[5]  Fred B. Schneider,et al.  Towards Fault-Tolerant and Secure Agentry , 1997, WDAG.

[6]  Robbert van Renesse,et al.  Operating system support for mobile agents , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[7]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[8]  Robbert van Renesse,et al.  An introduction to the TACOMA distributed system. Version 1.0 , 1995 .

[9]  Sam Toueg,et al.  Fault-tolerant broadcasts and related problems , 1993 .

[10]  Alberto Montresor,et al.  System support for partition-aware network applications , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[11]  Richard D. Schlichting,et al.  Fault-Tolerant Broadcasts , 1984, Sci. Comput. Program..

[12]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[13]  Gene Tsudik,et al.  Itinerant Agents for Mobile Computing , 1995, IEEE Communications Surveys & Tutorials.