Towards a Simple Approach to Multi-step Model-based Reinforcement Learning

When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes. Model-based agents typically learn a single-step transition model. In this paper, we propose a multi-step model that predicts the outcome of an action sequence with variable length. We show that this model is easy to learn, and that the model can make policy-conditional predictions. We report preliminary results that show a clear advantage for the multi-step model compared to its one-step counterpart.

[1]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[2]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[4]  Martial Hebert,et al.  Improving Multi-Step Prediction of Learned Time Series Models , 2015, AAAI.

[5]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[8]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[9]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[10]  Bernardo Ávila Pires,et al.  Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models , 2016, COLT.

[11]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[12]  Erik Talvitie,et al.  Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[13]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[14]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[15]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[16]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[17]  A. Markman,et al.  Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[18]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[21]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[22]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[23]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[24]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[25]  Satinder P. Singh,et al.  Linear options , 2010, AAMAS.

[26]  G. Box Science and Statistics , 1976 .

[27]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[28]  Rich Sutton,et al.  A Deeper Look at Planning as Learning from Replay , 2015, ICML.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Shalabh Bhatnagar,et al.  Multi-Step Dyna Planning for Policy Evaluation and Control , 2009, NIPS.

[31]  Kavosh Asadi Atui Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning , 2016 .

[32]  Richard S. Sutton,et al.  Integrated Modeling and Control Based on Reinforcement Learning , 1990, NIPS.

[33]  M. Littman,et al.  Mean Actor Critic , 2017, ArXiv.

[34]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[35]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[36]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[37]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[38]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.