论文信息 - Energy-based Surprise Minimization for Multi-Agent Value Factorization

Energy-based Surprise Minimization for Multi-Agent Value Factorization

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant success in training decentralised policies in a centralised manner by making use of value factorization methods. However, addressing surprise across spurious states and approximation bias remain open problems for multi-agent settings. We introduce the Energy-based MIXer (EMIX), an algorithm which minimizes surprise utilizing the energy across agents. Our contributions are threefold; (1) EMIX introduces a novel surprise minimization technique across multiple agents in the case of multi-agent partially-observable settings. (2) EMIX highlights the first practical use of energy functions in MARL (to our knowledge) with theoretical guarantees and experiment validations of the energy operator. Lastly, (3) EMIX presents a novel technique for addressing overestimation bias across agents in MARL. When evaluated on a range of challenging StarCraft II micromanagement scenarios, EMIX demonstrates consistent state-of-the-art performance for multi-agent surprise minimization. Moreover, our ablation study highlights the necessity of the energy-based scheme and the need for elimination of overestimation bias in MARL. Our implementation of EMIX and videos of agents are available at this https URL.

Karush Suri | Xiao Qi Shi | Konstantinos Plataniotis | Yuri Lawryshyn

[1] S. Shankar Sastry,et al. Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[2] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[3] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[4] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[5] Geoffrey E. Hinton,et al. Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[6] Qingfeng Lan,et al. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Guy Lever,et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[9] Yan Zheng,et al. Weighted Double Deep Multiagent Reinforcement Learning in Stochastic Cooperative Environments , 2018, PRICAI.

[10] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[11] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[12] A. Cardoso,et al. Modeling Forms of Surprise in Artificial Agents: Empirical and Theoretical Study of Surprise Functions , 2004 .

[13] Tom Schaul,et al. StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[14] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[15] Yuhong Yang,et al. Information Theory, Inference, and Learning Algorithms , 2005 .

[16] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[17] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18] Drew Wicke,et al. Multiagent Soft Q-Learning , 2018, AAAI Spring Symposia.

[19] E.M. Atkins,et al. A survey of consensus problems in multi-agent coordination , 2005, Proceedings of the 2005, American Control Conference, 2005..

[20] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[21] Christopher Amato,et al. Likelihood Quantile Networks for Coordinating Multi-Agent Reinforcement Learning , 2018, AAMAS.

[22] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[23] Bart De Schutter,et al. Multi-Agent Reinforcement Learning: A Survey , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[24] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[25] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[26] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[27] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28] Shimon Whiteson,et al. Weighted QMIX: Expanding Monotonic Value Function Factorisation , 2020, ArXiv.

[29] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[30] Sergey Levine,et al. Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[31] Alexei A. Efros,et al. Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[32] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[33] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[34] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35] Sergey Levine,et al. SMiRL: Surprise Minimizing RL in Entropic Environments , 2019 .