Engineering Control Strategies for Replication-Based Fault-Tolerant Multi-Agent Systems

This work focuses on the engineering of software replication techniques for distributed cooperative applications designed as multi-agent systems. Such applications are often very dynamic: e.g., new agents can join or leave, they can change roles or strategies. Also, the relative importances of agents may evolve during the course of computation and cooperation, as opposed to traditional static approaches of replication, e.g., for data bases, where critical servers may be identified at design time. Thus, we need to dynamically and automatically identify the most critical agents and to adapt their replication strategies (e.g., active or passive, number of replicas), in order to maximize their reliability and their availability. An important issue is then: what kind of information could be used to estimate which agents are most critical agents? In this paper, we first introduce our approach and prototype architecture for adaptive replication. Then, we discuss various kinds of information and strategies to estimate criticality of agents: dynamic dependences, roles, and plans. Some preliminary measurements and future directions are also presented.

[1]  Staffan Haegg,et al.  A Sentinel Approach to Fault Handling in Multi-Agent Systems , 1996, DAI.

[2]  Jørgen Lindskov Knudsen,et al.  Advances in Exception Handling Techniques , 2001, Lecture Notes in Computer Science.

[3]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[4]  Jean-Pierre Briot,et al.  Adaptive replication of large-scale multi-agent systems: towards a fault-tolerant multi-agent platform , 2005, ACM SIGSOFT Softw. Eng. Notes.

[5]  Rachid Guerraoui,et al.  Lessons from Designing and Implementing GARF , 1995, OBPDC.

[6]  Ralph Deters,et al.  Improving fault-tolerance by replicating agents , 2002, AAMAS '02.

[7]  Victor R. Lesser,et al.  Using self-diagnosis to adapt organizational structures , 2000, Proceedings Fourth International Conference on MultiAgent Systems.

[8]  Milind Tambe,et al.  Monitoring Teams by Overhearing: A Multi-Agent Plan-Recognition Approach , 2002, J. Artif. Intell. Res..

[9]  Pierre Sens,et al.  DARX - a framework for the fault-tolerant support of agent software , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[10]  Michael Golm,et al.  metaXa and the Future of Reflection , 1998 .

[11]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Reid G. Smith,et al.  The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver , 1980, IEEE Transactions on Computers.

[13]  Samir Aknine,et al.  Plan-based replication for fault-tolerant multi-agent systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.