A common gradient in multi-agent reinforcement learning

This article shows that seemingly diverse implementations of multi-agent reinforcement learning share the same basic building block in their learning dynamics: a mathematical term that is closely related to the gradient of the expected reward. Specifically, two independent branches of multi-agent learning research can be distinguished based on their respective assumptions and premises. The first branch assumes that the value function of the game is known to all players, which is then used to update the learning policy based on Gradient Ascent. Notable algorithms in this branch include Infinitesimal Gradient Ascent (IGA) [7], the variation Win or Learn Fast IGA (WoLF) [3] and the Weighted Policy Learner [1]. The second branch of multi-agent learning is concerned with learning in unknown environments, using interaction-based Reinforcement Learning, and contains those algorithms which have been shown to be formally connected to the replicator equations of Evolutionary Game Theory. In this case, the learning agent updates its policy based on a sequence of pairs that indicate the quality of the actions taken. Notable algorithms include Cross Learning (CL) [4], Regret Minimization (RM) [6], and Frequency Adjusted Q-learning (FAQ) [5]. This article demonstrates inherent similarities between these diverse families of algorithms by comparing their underlying learning dynamics, derived as the continuous time limit of their policy updates. These dynamics have already been investigated for algorithms from each family separately [1, 2, 3, 5, 6, 7], however, they have not yet been discussed in context of the relation to each other, and the origin of their similarity has not been discussed satisfactorily. In addition to the formal derivation, directional field plots of the learning dynamics in representative classes of two-player two-action games illustrate the similarities and strengthen the theoretical findings.