Variational methods for Reinforcement Learning

We consider reinforcement learning as solving a Markov decision process with unknown transition distribution. Based on interaction with the environment, an estimate of the transition matrix is obtained from which the optimal decision policy is formed. The classical maximum likelihood point estimate of the transition model does not reect the uncertainty in the estimate of the transition model and the resulting policies may consequently lack a sucient degree of exploration. We consider a Bayesian alternative that maintains a distribution over the transition so that the resulting policy takes into account the limited experience of the environment. The resulting algorithm is formally intractable and we discuss two approximate solution methods, Variational Bayes and Expectation Propagation.

[1]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[2]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[3]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[4]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[5]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[6]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[7]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[8]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[11]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[12]  Matthew W. Hoffman,et al.  Trans-dimensional MCMC for Bayesian policy learning , 2007, NIPS 2007.

[13]  Carl E. Rasmussen,et al.  Probabilistic Inference for Fast Learning in Control , 2008, EWRL.

[14]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[15]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[16]  D. Barber,et al.  Solving deterministic policy ( PO ) MDPs using Expectation-Maximisation and Antifreeze , 2009 .