论文信息 - Universal Value Function Approximators

Universal Value Function Approximators

Value functions are a core component of reinforcement learning systems. The main idea is to to construct a single function approximator V (s; θ) that estimates the long-term reward from any state s, using parameters θ. In this paper we introduce universal value function approximators (UVFAs) V (s, g; θ) that generalise not just over states s but also over goals g. We develop an efficient technique for supervised learning of UVFAs, by factoring observed values into separate embedding vectors for state and goal, and then learning a mapping from s and g to these factored embedding vectors. We show how this technique may be incorporated into a reinforcement learning algorithm that updates the UVFA solely from observed rewards. Finally, we demonstrate that a UVFA can successfully generalise to previously unseen goals.

[1] Peter Englert,et al. Multi-task policy search for robotics , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[2] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[3] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[4] Andrea Montanari,et al. Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[5] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6] Martijn van Otterlo,et al. The Logic of Adaptive Behavior - Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains , 2009, Frontiers in Artificial Intelligence and Applications.

[7] Peter Dayan,et al. Structure in the Space of Value Functions , 2002, Machine Learning.

[8] Leslie Pack Kaelbling,et al. Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[9] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[10] Mark B. Ring. Continual learning in reinforcement environments , 1995, GMD-Bericht.

[11] Tom Schaul,et al. Better Generalization with Forecasts , 2013, IJCAI.

[12] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[13] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[14] Richard S. Sutton,et al. Temporal-Difference Networks , 2004, NIPS.

[15] Bruno Castro da Silva,et al. Learning Parameterized Skills , 2012, ICML.

[16] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17] Andrew G. Barto,et al. Transfer in Reinforcement Learning via Shared Features , 2012, J. Mach. Learn. Res..

[18] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[20] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[21] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[22] Jan Peters,et al. Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[23] Andrea Montanari,et al. Matrix completion from a few entries , 2009, ISIT.

[24] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[25] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26] Peter Stone,et al. Learning Predictive State Representations , 2003, ICML.

[27] Shalabh Bhatnagar,et al. Universal Option Models , 2014, NIPS.