Model-Based Reinforcement Learning for Atari

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.

[1]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[4]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[5]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[6]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[10]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[11]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[14]  Sanem Sariel,et al.  Learning Behaviors of and Interactions Among Objects Through Spatio–Temporal Reasoning , 2015, IEEE Transactions on Computational Intelligence and AI in Games.

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Martial Hebert,et al.  Improved Learning of Dynamics Models for Control , 2016, ISER.

[18]  Katja Hofmann,et al.  A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games , 2016, ICLR 2016.

[19]  Sergey Levine,et al.  Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[21]  Stephen Tyree,et al.  Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[22]  Joshua B. Tenenbaum,et al.  Human Learning in Atari , 2017, AAAI Spring Symposia.

[23]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[24]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Daan Wierstra,et al.  Recurrent Environment Simulators , 2017, ICLR.

[26]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[27]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[30]  Boyang Li,et al.  Game Engine Learning from Video , 2017, IJCAI.

[31]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[32]  Peter Vrancx,et al.  Model-Based Regularization for Deep Reinforcement Learning with Transcoder Networks , 2018 .

[33]  Samy Bengio,et al.  Discrete Autoencoders for Sequence Models , 2018, ArXiv.

[34]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[35]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[36]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[37]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[38]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[39]  Kostas Daniilidis,et al.  Unsupervised Learning of Sensorimotor Affordances by Stochastic Future Prediction , 2018, ArXiv.

[40]  Erik Talvitie,et al.  The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces , 2018, ArXiv.

[41]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[42]  Stephan Alaniz,et al.  Deep Reinforcement Learning with Model Learning and Monte Carlo Tree Search in Minecraft , 2018, ArXiv.

[43]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[44]  Haitham Bou-Ammar,et al.  Model-Based Stabilisation of Deep Reinforcement Learning , 2018, ArXiv.

[45]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[46]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[47]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[48]  Kamyar Azizzadenesheli,et al.  Sample-Efficient Deep RL with Generative Adversarial Tree Search , 2018, ArXiv.

[49]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[50]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[51]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[52]  Sergey Levine,et al.  Learning Powerful Policies by Using Consistent Dynamics Model , 2019, ArXiv.

[53]  Nicolas Heess,et al.  Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , 2018, ICLR.

[54]  Kacper Kielak Do recent advancements in model-based deep reinforcement learning really improve data efficiency? , 2019 .

[55]  Michael S. Ryoo,et al.  Learning Real-World Robot Policies by Dreaming , 2018, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[56]  Gregory D. Hager,et al.  Visual Robot Task Planning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[57]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.