Merging Individually Learned Optimal Results to Accelerate Coordination

By merging agents’ individually learned optimal value functions, agents can learn their optimal policies in a multiagent system. Pre-knowledge of the task is used to decompose it into several subtasks and this decomposition greatly reduces the state and action spaces. The optimal value functions of each subtask are learned by MAXQ-Q[1] algorithm. By defining the lower and upper bound on the value functions of the whole task, we propose a novel online multiagent learning algorithm LU-Q, and LU-Q accelerates learning of coordination between multiple agents by task decomposition and action pruning.