论文信息 - Value-driven Hindsight Modelling - 字舞流文

Value-driven Hindsight Modelling

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.

Doina Precup | David Silver | Nicolas Heess | Fabio Viola | Lars Buesing | Arthur Guez | Steven Kapturowski | Th'eophane Weber

[1] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[3] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[4] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5] Richard L. Lewis,et al. Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[6] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[7] Tom Eccles,et al. An investigation of model-free planning , 2019, ICML.

[8] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[10] Rauf Izmailov,et al. Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[11] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[12] David Silver,et al. Credit Assignment Techniques in Stochastic Computation Graphs , 2019, AISTATS.

[13] Doina Precup,et al. Hindsight Credit Assignment , 2019, NeurIPS.

[14] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[15] Amir-massoud Farahmand,et al. Iterative Value-Aware Model Learning , 2018, NeurIPS.

[16] Marcin Andrychowicz,et al. Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[17] Erik Talvitie,et al. Model Regularization for Stable Sample Rollouts , 2014, UAI.

[18] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[19] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[20] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[21] Razvan Pascanu,et al. Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[22] Nando de Freitas,et al. Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[23] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[24] Nicolas Heess,et al. Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , 2018, ICLR.

[25] Marc G. Bellemare,et al. DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[26] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.