论文信息 - Mastering Atari, Go, chess and shogi by planning with a learned model - 字舞流文

Mastering Atari, Go, chess and shogi by planning with a learned model

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess1 and Go2, where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games3-the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled4-the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi-canonical environments for high-performance planning-the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm5 that was supplied with the rules of the game.

Demis Hassabis | Karen Simonyan | David Silver | Thore Graepel | Thomas Hubert | Julian Schrittwieser | Ioannis Antonoglou | Arthur Guez | Laurent Sifre | Timothy Lillicrap | Edward Lockhart | Simon Schmitt | L. Sifre | T. Lillicrap | D. Hassabis | D. Silver | A. Guez | Ioannis Antonoglou | T. Graepel | K. Simonyan | T. Hubert | Julian Schrittwieser | Edward Lockhart | Simon Schmitt | David Silver

[1] Jonathan Schaeffer,et al. A World Championship Caliber Checkers Program , 1992, Artif. Intell..

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] Subbarao Kambhampati,et al. Planning and Scheduling , 1997, The Computer Science and Engineering Handbook.

[4] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5] Murray Campbell,et al. Deep Blue , 2002, Artif. Intell..

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[8] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[9] Rémi Coulom,et al. Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[10] H. Jaap van den Herik,et al. Single-Player Monte-Carlo Tree Search , 2008, Computers and Games.

[11] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[12] Christopher D. Rosin,et al. Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[13] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[15] Thomas B. Schön,et al. From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[16] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[17] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[18] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[21] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[22] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[23] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[25] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[26] Kevin Waugh,et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[27] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[28] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[29] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[30] Shimon Whiteson,et al. TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning , 2017, ICLR 2018.

[31] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[32] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[33] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract) , 2018, IJCAI.

[34] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[35] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[36] Jürgen Schmidhuber,et al. Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[37] Weitang Liu,et al. Surprising Negative Results for Generative Adversarial Tree Search , 2018, 1806.05780.

[38] Noam Brown,et al. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[39] Mike Preuss,et al. Planning chemical syntheses with deep neural networks and symbolic AI , 2017, Nature.

[40] Fabio Viola,et al. Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[41] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[42] Marc G. Bellemare,et al. DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[43] Matteo Hessel,et al. When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[44] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[45] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[46] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[47] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[48] Karen Simonyan,et al. Off-Policy Actor-Critic with Shared Experience Replay , 2020, ICML.