Dynamic snapshot algorithm and partial rollback algorithm for internet agents

This paper considers an Internet agent system in which a tremendous number of agents operate, frequently appearing and disappearing, and discusses the fault-tolerant algorithm. Application of the snapshot algorithm to the agent system is considered. The snapshot algorithm is used to view the whole situation (snapshot) of the distributed system. The snapshot algorithm of Chandy and Lamport [2] is considered as a representative snapshot algorithm, in terms of the high efficiency and the simplicity of the procedure. It is not practical, however, to apply their snapshot algorithm to the distributed agent system in which a tremendous number of agents operate. From such a viewpoint, this paper extends the idea of Chandy and Lamport's algorithm and proposes a subsnapshot algorithm, in which the snapshot is taken among the agents who are in the causal relation, through message exchange and agent creation. Then, an efficient rollback algorithm is proposed, which is based on the snapshots taken by the subsnapshot algorithm. In the general rollback algorithm utilizing the snapshot, all agents must roll back. In contrast, in the rollback algorithm proposed in this paper, it suffices that only some agents should roll back. © 2005 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 88(12): 43–57, 2005; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20208

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  Reid G. Smith,et al.  The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver , 1980, IEEE Transactions on Computers.

[3]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[4]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[7]  A distributed consistent global checkpoint algorithm with a minimum number of checkpoints , 1998, Proceedings Twelfth International Conference on Information Networking (ICOIN-12).

[8]  Michael B. Dillencourt,et al.  An application-transparent, platform-independent approach to rollback-recovery for mobile agent systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.