Bayesian Policy Gradient and Actor-Critic Algorithms

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this paper, we first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed Bayesian framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and thus, can be easily extended to partially observable problems. On the downside, it cannot take advantage of the Markov property when the system is Markovian. To address this issue, we proceed to supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non-parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action-value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values allow us to obtain closed-form expressions for the posterior distribution of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems.

[1]  F. e. Calcul des Probabilités , 1889, Nature.

[2]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[3]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[4]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[5]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[6]  L. Hasdorff Gradient Optimization and Nonlinear Control , 1976 .

[7]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[10]  Peter W. Glynn,et al.  Proceedings of Ihe 1986 Winter Simulation , 2022 .

[11]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[12]  Anthony O'Hagan,et al.  Monte Carlo is fundamentally unsound , 1987 .

[13]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[14]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[15]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[16]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[17]  Vijaykumar Gullapalli,et al.  Learning Control Under Extreme Uncertainty , 1992, NIPS.

[18]  Eduardo D. Sontag,et al.  Neural Networks for Control , 1993 .

[19]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[22]  Shigenobu Kobayashi,et al.  Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward , 1995, ICML.

[23]  P. Glynn,et al.  Likelihood ratio gradient estimation for stochastic recursions , 1995, Advances in Applied Probability.

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[26]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[27]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[28]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[29]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[30]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[31]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[32]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[33]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[34]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[35]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[36]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[37]  Michael O. Duff,et al.  Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[38]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[39]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[40]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[41]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[42]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[43]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[44]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[45]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[46]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[47]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[48]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[49]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[50]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[51]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[52]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[53]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[54]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[55]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[56]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[57]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[58]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[59]  Sriraam Natarajan,et al.  Transfer in variable-reward hierarchical reinforcement learning , 2008, Machine Learning.

[60]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[61]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[62]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[63]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[64]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[65]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[66]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[67]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[68]  Alessandro Lazaric,et al.  Bayesian Multi-Task Reinforcement Learning , 2010, ICML.

[69]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[70]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[71]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[72]  J. Grossman The Likelihood Principle , 2011 .

[73]  Kee-Eung Kim,et al.  MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[74]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[75]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[76]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[77]  Jonathan P. How,et al.  Improving the efficiency of Bayesian inverse reinforcement learning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[78]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[79]  Kee-Eung Kim,et al.  Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions , 2012, NIPS.

[80]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[81]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[82]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[83]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[84]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[85]  Sudipto Guha,et al.  Stochastic Regret Minimization via Thompson Sampling , 2014, COLT.

[86]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[87]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[88]  Lihong Li,et al.  On the Prior Sensitivity of Thompson Sampling , 2015, ALT.

[89]  GhavamzadehMohammad,et al.  Bayesian policy gradient and actor-critic algorithms , 2016 .

[90]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..