Planning with Expectation Models

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.

[1]  Katja Hofmann,et al.  A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games , 2016, ICLR 2016.

[2]  Martha White,et al.  Learning Sparse Representations in Reinforcement Learning with Sparse Coding , 2017, IJCAI.

[3]  Martha White,et al.  Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains , 2018, IJCAI.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[6]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[7]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[8]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[9]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[10]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[13]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[14]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[15]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[16]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[17]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[18]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[19]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[20]  Richard S. Sutton,et al.  Integrated Modeling and Control Based on Reinforcement Learning and Dynamic Programming , 1990, NIPS 1990.

[21]  Jean-Arcady Meyer,et al.  Adaptive Behavior , 2005 .

[22]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[23]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[24]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.