论文信息 - Bayesian actor-critic algorithms

Bayesian actor-critic algorithms

We present a new actor-critic learning model in which a Bayesian class of non-parametric critics, using Gaussian process temporal difference learning is used. Such critics model the state-action value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over state-action value functions, conditioned on the observed data. Appropriate choices of the prior covariance (kernel) between state-action values and of the parametrization of the policy allow us to obtain closed-form expressions for the posterior distribution of the gradient of the average discounted return with respect to the policy parameters. The posterior mean, which serves as our estimate of the policy gradient, is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update.

Mohammad Ghavamzadeh | Yaakov Engel | M. Ghavamzadeh | Y. Engel

[1] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2] A. O'Hagan,et al. Bayes–Hermite quadrature , 1991 .

[3] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[5] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[6] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[9] Shie Mannor,et al. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[10] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[11] Yaakov Engel,et al. Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[12] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[13] Mohammad Ghavamzadeh,et al. Bayesian Policy Gradient Algorithms , 2006, NIPS.