论文信息 - Actor-Critic Reinforcement Learning with Energy-Based Policies - 字舞流文

Actor-Critic Reinforcement Learning with Energy-Based Policies

We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-di!erence (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and e"cient algorithm for training energy-based policies, based on an actorcritic architecture. Our algorithm is computationally e"cient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.

Yee Whye Teh | David Silver | Nicolas Heess | D. Silver | N. Heess | Y. Teh | David Silver

[1] David Haussler,et al. Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[2] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[3] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[5] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[6] Geoffrey E. Hinton,et al. Reinforcement learning for factored Markov decision processes , 2002 .

[7] Geoffrey E. Hinton,et al. Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[8] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9] Geoffrey E. Hinton,et al. Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[10] Peter Szabó,et al. Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[11] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[12] Geoffrey E. Hinton,et al. Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[13] Gökhan BakIr,et al. Predicting Structured Data , 2008 .

[14] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[15] Yoshua Bengio,et al. Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[16] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[17] Geoffrey E. Hinton,et al. Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[18] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[19] Kenji Doya,et al. Free-Energy Based Reinforcement Learning for Vision-Based Navigation with High-Dimensional Sensory Inputs , 2010, ICONIP.

[20] Geoffrey E. Hinton,et al. Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[21] Junichiro Yoshimoto,et al. Free-energy-based reinforcement learning in a partially observable environment , 2010, ESANN.

[22] Geoffrey E. Hinton,et al. Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.