Actor-Critic Reinforcement Learning with Energy-Based Policies

We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-di!erence (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and e"cient algorithm for training energy-based policies, based on an actorcritic architecture. Our algorithm is computationally e"cient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.

[1]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[2]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[5]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[6]  Geoffrey E. Hinton,et al.  Reinforcement learning for factored Markov decision processes , 2002 .

[7]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[10]  Peter Szabó,et al.  Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[11]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[12]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[13]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[14]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[15]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[16]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[17]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[18]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[19]  Kenji Doya,et al.  Free-Energy Based Reinforcement Learning for Vision-Based Navigation with High-Dimensional Sensory Inputs , 2010, ICONIP.

[20]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[21]  Junichiro Yoshimoto,et al.  Free-energy-based reinforcement learning in a partially observable environment , 2010, ESANN.

[22]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.