Stein Variational Policy Gradient

Policy gradient methods have been successfully applied to many complex reinforcement learning problems. However, policy gradient methods suffer from high variance, slow convergence, and inefficient exploration. In this work, we introduce a maximum entropy policy optimization framework which explicitly encourages parameter exploration, and show that this framework can be reduced to a Bayesian inference problem. We then propose a novel Stein variational policy gradient method (SVPG) which combines existing policy gradient methods and a repulsive functional to generate a set of diverse but well-behaved policies. SVPG is robust to initialization and can easily be implemented in a parallel manner. On continuous control problems, we find that implementing SVPG on top of REINFORCE and advantage actor-critic algorithms improves both average return and data efficiency.

[1]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[2]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[3]  James C. Spall,et al.  AN OVERVIEW OF THE SIMULTANEOUS PERTURBATION METHOD FOR EFFICIENT OPTIMIZATION , 1998 .

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[6]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[7]  David H. Wolpert,et al.  Information Theory - The Bridge Connecting Bounded Rational Game Theory and Statistical Physics , 2004, ArXiv.

[8]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[13]  Daniel A. Braun,et al.  Information, Utility and Bounded Rationality , 2011, AGI.

[14]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[18]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[19]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[20]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[21]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[24]  Shane Legg,et al.  DeepMind Lab , 2016, ArXiv.

[25]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[26]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[27]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[28]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[29]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[30]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[31]  Dale Schuurmans,et al.  Improving Policy Gradient by Exploring Under-appreciated Rewards , 2016, ICLR.

[32]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[33]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.