The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces

Dyna is an architecture for reinforcement learning agents that interleaves planning, acting, and learning in an online setting. This architecture aims to make fuller use of limited experience to achieve better performance with fewer environmental interactions. Dyna has been well studied in problems with a tabular representation of states, and has also been extended to some settings with larger state spaces that require function approximation. However, little work has studied Dyna in environments with high-dimensional state spaces like images. In Dyna, the environment model is typically used to generate one-step transitions from selected start states. We applied one-step Dyna to several games from the Arcade Learning Environment and found that the model-based updates offered surprisingly little benefit, even with a perfect model. However, when the model was used to generate longer trajectories of simulated experience, performance improved dramatically. This observation also holds when using a model that is learned from experience; even though the learned model is flawed, it can still be used to accelerate learning.

[1]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[2]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[3]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[4]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[5]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[6]  Daan Wierstra,et al.  Recurrent Environment Simulators , 2017, ICLR.

[7]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[8]  Martha White,et al.  Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains , 2018, IJCAI.

[9]  Marc G. Bellemare,et al.  Skip Context Tree Switching , 2014, ICML.

[10]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[11]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[12]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[13]  Doina Precup,et al.  Dyna Planning using a Feature Based Generative Model , 2018, ArXiv.

[14]  Ben Calderhead,et al.  Advances in Neural Information Processing Systems 29 , 2016 .

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[16]  V. Kaul,et al.  Planning , 2012 .

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Kilian Q. Weinberger,et al.  Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , 2016 .

[19]  John Langford,et al.  Proceedings of the 29th International Conference on Machine Learning (ICML-12) , 2012, ArXiv.

[20]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[21]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[22]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[26]  Shalabh Bhatnagar,et al.  Multi-Step Dyna Planning for Policy Evaluation and Control , 2009, NIPS.

[27]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[28]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[29]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[30]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[31]  Marc G. Bellemare,et al.  Bayesian Learning of Recursively Factored Environments , 2013, ICML.

[32]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[33]  Ram Ramamoorthy,et al.  Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence , 2014 .

[34]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[35]  Marlos C. Machado,et al.  State of the Art Control of Atari Games Using Shallow Reinforcement Learning , 2015, AAMAS.

[36]  Jean-Arcady Meyer,et al.  Adaptive Behavior , 2005 .

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[39]  Peter A. Flach,et al.  Advances in Neural Information Processing Systems 28 , 2015 .

[40]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[41]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.