Bayesian Policy Learning with Trans-Dimensional MCMC

A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new understanding, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to carry out full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

[1]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[4]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[5]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[6]  J. Bernardo,et al.  Simulation-Based Optimal Design , 1999 .

[7]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[8]  P. Green,et al.  Trans-dimensional Markov chain Monte Carlo , 2000 .

[9]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[10]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[11]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[12]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[13]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[14]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[15]  P. Müller,et al.  Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation , 2004 .

[16]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[17]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[18]  T. Field,et al.  Policy-Gradient Learning for Motor Control , 2005 .

[19]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[20]  Nando de Freitas,et al.  Fast particle smoothing: if I had a million particles , 2006, ICML.

[21]  Rajesh P. N. Rao,et al.  Planning and Acting in Uncertain Environments using Probabilistic Inference , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[23]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[25]  Stefan Schaal,et al.  Reinforcement Learning for Operational Space Control , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.