论文信息 - Learning Diverse Risk Preferences in Population-based Self-play - 字舞流文

Learning Diverse Risk Preferences in Population-based Self-play

Among the great successes of Reinforcement Learning (RL), self-play algorithms play an essential role in solving competitive games. Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies, making it often stuck in the local optimum and its strategy style simple and homogeneous. A possible solution is to improve the diversity of policies, which helps the agent break the stalemate and enhances its robustness when facing different opponents. However, enhancing diversity in the self-play algorithms is not trivial. In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. Specifically, we design a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning and allows for policy learning with desired risk preferences. Seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives with experiences from playing against diverse opponents. Empirical results show that our method achieves comparable or superior performance in competitive games and that diverse modes of behaviors emerge. Our code is public online at \url{https://github.com/Jackory/RPBT}.

Xiaoteng Ma | Yiqin Yang | Chenghao Li | Qianchuan Zhao | Bin Liang | Jun Yang | Qihan Liu | Y. Jiang

[1] Chao Yu,et al. Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased , 2023, ICLR.

[2] Dong Yan,et al. Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk , 2022, IJCAI.

[3] Zihan Zhou,et al. Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization , 2022, ICLR.

[4] N. Heess,et al. NeuPL: Neural Population Learning , 2022, ICLR.

[5] Yi Wu,et al. Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination , 2021, AAAI.

[6] Shane Legg,et al. Model-Free Risk-Sensitive Reinforcement Learning , 2021, ArXiv.

[7] Bin Liang,et al. Offline Reinforcement Learning with Value-based Episodic Memory , 2021, ICLR.

[8] Richard Everett,et al. Collaborating with Humans without Human Data , 2021, NeurIPS.

[9] Junhyuk Oh,et al. Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity , 2021, AAMAS.

[10] Qianchuan Zhao,et al. Celebrating Diversity in Shared Multi-Agent Reinforcement Learning , 2021, NeurIPS.

[11] Matthijs T. J. Spaan,et al. WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning , 2021, AAAI.

[12] Yaodong Yang,et al. Modelling Behavioural Diversity for Learning in Open-Ended Games , 2021, ICML.

[13] Javier Ruiz-del-Solar,et al. Learning to Play Soccer From Scratch: Sample-Efficient Emergent Coordination Through Curriculum-Learning and Competition , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14] Boyuan Chen,et al. Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization , 2021, ICLR.

[15] Yu Wang,et al. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[16] Xinrun Wang,et al. RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents , 2021, NeurIPS.

[17] Matthew E. Taylor,et al. Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems , 2021, AAMAS.

[18] Yan Zheng,et al. Generating Behavior-Diverse Game AIs with Evolutionary Multi-Objective Deep Reinforcement Learning , 2020, IJCAI.

[19] Roy Fox,et al. Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[20] Max Jaderberg,et al. Real World Games Look Like Spinning Tops , 2020, NeurIPS.

[21] K. Choromanski,et al. Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[22] Ruslan Salakhutdinov,et al. Worst Cases Policy Gradients , 2019, CoRL.

[23] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[24] Igor Mordatch,et al. Emergent Tool Use From Multi-Agent Autocurricula , 2019, ICLR.

[25] Finale Doshi-Velez,et al. Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies , 2019, IJCAI.

[26] Greg Turk,et al. Learning Novel Policies For Tasks , 2019, ICML.

[27] Kagan Tumer,et al. Collaborative Evolutionary Reinforcement Learning , 2019, ICML.

[28] Marc G. Bellemare,et al. Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[29] Guy Lever,et al. Emergent Coordination Through Competition , 2019, ICLR.

[30] Ang Li,et al. A Generalized Framework for Population Based Training , 2019, KDD.

[31] Max Jaderberg,et al. Open-ended Learning in Symmetric Zero-sum Games , 2019, ICML.

[32] M. Fu,et al. Risk-Sensitive Reinforcement Learning via Policy Gradient Search , 2018, Found. Trends Mach. Learn..

[33] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[34] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[35] Zhang-Wei Hong,et al. Diversity-Driven Exploration Strategy for Deep Reinforcement Learning , 2018, NeurIPS.

[36] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[37] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[38] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[39] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[40] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[41] P. Abbeel,et al. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[42] Jakub W. Pachocki,et al. Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[43] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[44] Frank Hutter,et al. CMA-ES for Hyperparameter Optimization of Deep Neural Networks , 2016, ArXiv.

[45] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46] David Silver,et al. Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[47] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[48] Michael C. Fu,et al. Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[49] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[50] Avrim Blum,et al. Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[51] A. Müller. Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[52] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[53] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[54] A. Tversky,et al. Advances in prospect theory: Cumulative representation of uncertainty , 1992 .

[55] W. Newey,et al. Asymmetric Least Squares Estimation and Testing , 1987 .

[56] Yaodong Yang,et al. A Unified Diversity Measure for Multiagent Reinforcement Learning , 2022, NeurIPS.

[57] Rousslan Fernand Julien Dossa,et al. CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms , 2022, J. Mach. Learn. Res..

[58] Yaodong Yang,et al. Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games , 2021, NeurIPS.

[59] Hengyuan Hu,et al. Trajectory Diversity for Zero-Shot Coordination , 2021, AAMAS.

[60] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .