Energy-based Surprise Minimization for Multi-Agent Value Factorization

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant success in training decentralised policies in a centralised manner by making use of value factorization methods. However, addressing surprise across spurious states and approximation bias remain open problems for multi-agent settings. We introduce the Energy-based MIXer (EMIX), an algorithm which minimizes surprise utilizing the energy across agents. Our contributions are threefold; (1) EMIX introduces a novel surprise minimization technique across multiple agents in the case of multi-agent partially-observable settings. (2) EMIX highlights the first practical use of energy functions in MARL (to our knowledge) with theoretical guarantees and experiment validations of the energy operator. Lastly, (3) EMIX presents a novel technique for addressing overestimation bias across agents in MARL. When evaluated on a range of challenging StarCraft II micromanagement scenarios, EMIX demonstrates consistent state-of-the-art performance for multi-agent surprise minimization. Moreover, our ablation study highlights the necessity of the energy-based scheme and the need for elimination of overestimation bias in MARL. Our implementation of EMIX and videos of agents are available at this https URL.

[1]  S. Shankar Sastry,et al.  Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[2]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[3]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[4]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[5]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[6]  Qingfeng Lan,et al.  Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[9]  Yan Zheng,et al.  Weighted Double Deep Multiagent Reinforcement Learning in Stochastic Cooperative Environments , 2018, PRICAI.

[10]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[11]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[12]  A. Cardoso,et al.  Modeling Forms of Surprise in Artificial Agents: Empirical and Theoretical Study of Surprise Functions , 2004 .

[13]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[14]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[15]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[16]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Drew Wicke,et al.  Multiagent Soft Q-Learning , 2018, AAAI Spring Symposia.

[19]  E.M. Atkins,et al.  A survey of consensus problems in multi-agent coordination , 2005, Proceedings of the 2005, American Control Conference, 2005..

[20]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[21]  Christopher Amato,et al.  Likelihood Quantile Networks for Coordinating Multi-Agent Reinforcement Learning , 2018, AAMAS.

[22]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[23]  Bart De Schutter,et al.  Multi-Agent Reinforcement Learning: A Survey , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[24]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[25]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[26]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Shimon Whiteson,et al.  Weighted QMIX: Expanding Monotonic Value Function Factorisation , 2020, ArXiv.

[29]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[30]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[31]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[32]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[33]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[34]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35]  Sergey Levine,et al.  SMiRL: Surprise Minimizing RL in Entropic Environments , 2019 .

[36]  A. Cardoso,et al.  The role of Surprise, Curiosity and Hunger on Exploration of Unknown Environments Populated with Entities , 2005, 2005 portuguese conference on artificial intelligence.

[37]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[38]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[39]  Jun Wang,et al.  Multi-Agent Reinforcement Learning , 2020, Deep Reinforcement Learning.

[40]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[41]  Masashi Sugiyama,et al.  Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics , 2019, ArXiv.

[42]  Marc'Aurelio Ranzato,et al.  Energy-Based Models in Document Recognition and Computer Vision , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[43]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[44]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[45]  Saeid Nahavandi,et al.  Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications , 2018, IEEE Transactions on Cybernetics.

[46]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[47]  Bikramjit Banerjee,et al.  Multi-agent reinforcement learning as a rehearsal for decentralized planning , 2016, Neurocomputing.

[48]  Haitham Bou-Ammar,et al.  Balancing Two-Player Stochastic Games with Soft Q-Learning , 2018, IJCAI.

[49]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[50]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[51]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[52]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[53]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[54]  John Langford,et al.  Efficient Exploration in Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[55]  Jerry Zikun Chen Reinforcement Learning Generalization with Surprise Minimization , 2020, ArXiv.

[56]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[57]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[58]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[60]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[61]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[62]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[63]  Tuomas Haarnoja,et al.  Acquiring Diverse Robot Skills via Maximum Entropy Deep Reinforcement Learning , 2018 .

[64]  Manuela M. Veloso,et al.  Multiagent Systems: A Survey from a Machine Learning Perspective , 2000, Auton. Robots.

[65]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.