论文信息 - γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction - 字舞流文

γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

We introduce the $\gamma$-model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with $\gamma$-models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The $\gamma$-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the $\gamma$-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.

Igor Mordatch | Michael Janner | Sergey Levine

[1] Jonas Buchli,et al. Learning of closed-loop motion control , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2] David Warde-Farley,et al. Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[3] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[4] Erik Talvitie,et al. Model Regularization for Stable Sample Rollouts , 2014, UAI.

[5] Yoshua Bengio,et al. Universal Successor Representations for Transfer Reinforcement Learning , 2018, ICLR.

[6] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[7] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[8] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[9] Michael I. Jordan,et al. Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[10] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[11] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[12] Martin A. Riedmiller,et al. Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[13] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[14] Sergey Levine,et al. Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[16] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[17] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[18] Iain Murray,et al. Neural Spline Flows , 2019, NeurIPS.

[19] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[20] Tom Schaul,et al. Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[21] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[22] Samuel J Gershman,et al. The Successor Representation: Its Computational Logic and Neural Substrates , 2018, The Journal of Neuroscience.

[23] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[24] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[25] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[26] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[27] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .

[28] Kavosh Asadi,et al. Combating the Compounding-Error Problem with a Multi-step Model , 2019, ArXiv.

[29] Sergey Levine,et al. Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[30] Peter Dayan,et al. Structure in the Space of Value Functions , 2002, Machine Learning.

[31] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[32] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33] Sergey Levine,et al. When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[34] Raymond Y. K. Lau,et al. Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Richard S. Sutton,et al. Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[36] M. Botvinick,et al. The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[37] Sergey Levine,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[38] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[39] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[40] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[41] Erik Talvitie,et al. Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[42] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[43] Byron Boots,et al. Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[44] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[45] Rob Fergus,et al. Understanding the Asymptotic Performance of Model-Based RL Methods , 2018 .

[46] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[47] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[48] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[49] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[50] Gabriel Kalweit,et al. Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.