DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

[1]  M. Ghavamzadeh,et al.  A Mixture-of-Expert Approach to RL-based Dialogue Management , 2022, ICLR.

[2]  Zihan Zhou,et al.  Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization , 2022, ICLR.

[3]  Huazhong Yang,et al.  Learning Efficient Multi-Agent Cooperative Visual Exploration , 2021, ECCV.

[4]  Shiyu Huang,et al.  TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations , 2021, ArXiv.

[5]  Sergey Levine,et al.  The Information Geometry of Unsupervised Reinforcement Learning , 2021, ICLR.

[6]  Zhenfang Zhu,et al.  Diverse dialogue generation by fusing mutual persona-aware and self-transferrer , 2021, Applied Intelligence.

[7]  Phillip Isola,et al.  Adaptable Agent Populations via a Generative Model of Policies , 2021, NeurIPS.

[8]  Brendan O'Donoghue,et al.  Discovering Diverse Nearly Optimal Policies withSuccessor Features , 2021, ArXiv.

[9]  Boyuan Chen,et al.  Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization , 2021, ICLR.

[10]  Yu Wang,et al.  The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[11]  Sergey Levine,et al.  One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL , 2020, NeurIPS.

[12]  Xiangyang Ji,et al.  Skill Discovery of Coordination in Multi-agent Reinforcement Learning , 2020, ArXiv.

[13]  Hang Su,et al.  SVQN: Sequential Variational Soft Q-Learning Networks , 2020, ICLR.

[14]  Joseph J. Lim,et al.  Learning to Coordinate Manipulation Skills via Skill Behavior Diversification , 2020, ICLR.

[15]  K. Choromanski,et al.  Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[16]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[17]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[18]  Shimon Whiteson,et al.  MAVEN: Multi-Agent Variational Exploration , 2019, NeurIPS.

[19]  Ting Chen,et al.  Combo-Action: Training Agent For FPS Game with Auxiliary Tasks , 2019, AAAI.

[20]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[21]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[22]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[23]  Wei Bi,et al.  Generating Multiple Diverse Responses for Short-Text Conversation , 2018, AAAI.

[24]  David Filliat,et al.  S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning , 2018, ArXiv.

[25]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[26]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[27]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[28]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[29]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[30]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[31]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[32]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[33]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[34]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[35]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[36]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[37]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.

[38]  Stéphane Doncieux,et al.  Overcoming the bootstrap problem in evolutionary robotics using behavioral diversity , 2009, 2009 IEEE Congress on Evolutionary Computation.

[39]  Brian D. Ziebart,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[40]  Masashi Sugiyama,et al.  Discovering Diverse Solutions in Deep Reinforcement Learning , 2021, ArXiv.

[41]  Traian Rebedea,et al.  Increasing Diversity with Deep Reinforcement Learning for Chatbots , 2020, RoCHI.

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.