论文信息 - γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction - 字舞流文

γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

We introduce the $\gamma$-model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with $\gamma$-models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The $\gamma$-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the $\gamma$-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.

S. Levine | Igor Mordatch | Michael Janner

[1] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2] David Warde-Farley,et al. Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[3] Sergey Levine,et al. When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[4] Iain Murray,et al. Neural Spline Flows , 2019, NeurIPS.

[5] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[6] Kavosh Asadi,et al. Combating the Compounding-Error Problem with a Multi-step Model , 2019, ArXiv.

[7] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[8] Byron Boots,et al. Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[9] Rob Fergus,et al. Understanding the Asymptotic Performance of Model-Based RL Methods , 2018 .

[10] Samuel J Gershman,et al. The Successor Representation: Its Computational Logic and Neural Substrates , 2018, The Journal of Neuroscience.

[11] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[12] Tom Schaul,et al. Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[13] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[14] Sergey Levine,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[15] Sergey Levine,et al. Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[16] Yoshua Bengio,et al. Universal Successor Representations for Transfer Reinforcement Learning , 2018, ICLR.

[17] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18] Sergey Levine,et al. Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[20] Gabriel Kalweit,et al. Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[21] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[23] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[24] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[25] Erik Talvitie,et al. Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[26] Raymond Y. K. Lau,et al. Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] M. Botvinick,et al. The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[28] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[29] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[30] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[31] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[32] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[33] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[34] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[35] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[36] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[37] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[38] Jonas Buchli,et al. Learning of closed-loop motion control , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39] Martin A. Riedmiller,et al. Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[40] Erik Talvitie,et al. Model Regularization for Stable Sample Rollouts , 2014, UAI.

[41] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[42] Richard S. Sutton,et al. Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[43] Peter Dayan,et al. Structure in the Space of Value Functions , 2002, Machine Learning.

[44] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[45] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[46] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[47] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[48] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[49] Michael I. Jordan,et al. Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[50] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[51] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[52] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .