论文信息 - Value Prediction Network

Value Prediction Network

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation.

[1] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[3] Doina Precup,et al. Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[4] Doina Precup,et al. Learning Options in Reinforcement Learning , 2002, SARA.

[5] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[6] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[7] Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[8] Tapani Raiko,et al. Variational Bayesian learning of nonlinear hidden state-space models for model predictive control , 2009, Neurocomputing.

[9] Shalabh Bhatnagar,et al. Multi-Step Dyna Planning for Policy Evaluation and Control , 2009, NIPS.

[10] Richard S. Sutton,et al. Temporal-difference search in computer Go , 2012, Machine Learning.

[11] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[12] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[13] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[16] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[17] Ross A. Knepper,et al. DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[18] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[19] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[20] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Alex Graves,et al. Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[23] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[24] Honglak Lee,et al. Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[25] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[26] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[27] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[28] Honglak Lee,et al. Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games , 2016, IJCAI.

[29] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[30] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[31] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[32] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[33] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[34] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[35] Alex Graves,et al. Video Pixel Networks , 2016, ICML.

[36] Pieter Abbeel,et al. Prediction and Control with Temporal Segment Models , 2017, ICML.

[37] Daan Wierstra,et al. Recurrent Environment Simulators , 2017, ICLR.

[38] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[39] Balaraman Ravindran,et al. Dynamic Action Repetition for Deep Reinforcement Learning , 2017, AAAI.

[40] Ruslan Salakhutdinov,et al. Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.