Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

Gradient based policy optimization algorithms suffer from high gradient variance, this is usually the result of using Monte Carlo estimates of the Qvalue function in the gradient calculation. By replacing this estimate with a function approximator on state-action space, the gradien t variance can be reduced significantly. In this paper we present a method for the training of a Gaussian Process to approximate the action-value functio n which can be used to replace the Monte Carlo estimation in the policy gradient evaluation. An iterative formulation of the algorithm will be given for better suitability with online learning.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[5]  L. Csató Gaussian processes:iterative sparse approximations , 2002 .

[6]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[8]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[9]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[10]  Opper On-line versus Off-line Learning from Random Examples: General Results. , 1996, Physical review letters.

[11]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[12]  Manfred Opper,et al.  Sparse Representation for Gaussian Process Models , 2000, NIPS.

[13]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[14]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Ole Winther,et al.  Efficient Approaches to Gaussian Process Classification , 1999, NIPS.