γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

We introduce the $\gamma$-model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with $\gamma$-models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The $\gamma$-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the $\gamma$-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.

[1]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[3]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[4]  Iain Murray,et al.  Neural Spline Flows , 2019, NeurIPS.

[5]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[6]  Kavosh Asadi,et al.  Combating the Compounding-Error Problem with a Multi-step Model , 2019, ArXiv.

[7]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[8]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[9]  Rob Fergus,et al.  Understanding the Asymptotic Performance of Model-Based RL Methods , 2018 .

[10]  Samuel J Gershman,et al.  The Successor Representation: Its Computational Logic and Neural Substrates , 2018, The Journal of Neuroscience.

[11]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[12]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[13]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[14]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[15]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[16]  Yoshua Bengio,et al.  Universal Successor Representations for Transfer Reinforcement Learning , 2018, ICLR.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Sergey Levine,et al.  Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[23]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[24]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[25]  Erik Talvitie,et al.  Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[26]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[28]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[29]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[30]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[31]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[32]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[33]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[34]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[35]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Jonas Buchli,et al.  Learning of closed-loop motion control , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39]  Martin A. Riedmiller,et al.  Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[40]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[41]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[42]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[43]  Peter Dayan,et al.  Structure in the Space of Value Functions , 2002, Machine Learning.

[44]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[45]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[46]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[47]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[48]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[49]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[50]  B. Widrow,et al.  Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[51]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[52]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .