Cooperative behavior acquisition by asynchronous policy renewal that enables simultaneous learning in multiagent environment

This paper presents a method for simultaneous learning in multiagent environment to facilitate cooperative behavior. Each agent has one policy and one action value function: the former is for action execution based on the action value function updated in the previous stage, and the latter is for learning based on the episodes experienced by the current policy. This makes all agents behave based on the fixed policies, so that the non-Markovian problem can be avoided except for the update periods that depend on the learning progress of each agent. In order to avoid the local maxima due to such asynchronous renewal of action value functions, optimistic action values are given initially, which helps to avoid the exploration process being trapped in local maxima. The experimental results applied to one of the cooperative tasks in a dynamic, multiagent environment, RoboCup, is shown and a discussion is given.