Value-driven Hindsight Modelling

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.

[1]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[3]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[4]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[5]  Nicolas Heess,et al.  Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , 2018, ICLR.

[6]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[7]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[8]  Nando de Freitas,et al.  Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[9]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  Doina Precup,et al.  Hindsight Credit Assignment , 2019, NeurIPS.

[12]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[15]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[16]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[17]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[18]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[19]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[20]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[21]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[22]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[23]  David Silver,et al.  Credit Assignment Techniques in Stochastic Computation Graphs , 2019, AISTATS.

[24]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.