Policy Evaluation Networks

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

[1]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[3]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[4]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[5]  A. Keane,et al.  Evolutionary Optimization of Computationally Expensive Problems via Surrogate Modeling , 2003 .

[6]  G. Box,et al.  On the Experimental Attainment of Optimum Conditions , 1951 .

[7]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[8]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[9]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[10]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[11]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[12]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[13]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[14]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[15]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[16]  Michèle Sebag,et al.  Self-adaptive surrogate-assisted covariance matrix adaptation evolution strategy , 2012, GECCO '12.

[17]  P. Glynn,et al.  Likelihood Ratio Gradient Estimation for Steady-State Parameters , 2017, Stochastic Systems.

[18]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[19]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[20]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[21]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[24]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[26]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[27]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[28]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[29]  Andrew W. Moore,et al.  Memory-based Stochastic Optimization , 1995, NIPS.

[30]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[31]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[32]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning , 2017, ICLR 2018.

[33]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[34]  John E. Dennis,et al.  Optimization Using Surrogate Objectives on a Helicopter Test Example , 1998 .

[35]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[36]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[37]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[38]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[39]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[40]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[41]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[42]  Marc G. Bellemare,et al.  A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[43]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[44]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..