论文信息 - Relative Entropy Regularized Policy Iteration

Relative Entropy Regularized Policy Iteration

We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997] to a policy iteration scheme. Our comparison on 31 continuous control tasks from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al., 2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results. Videos, summarizing results, can be found at goo.gl/HtvJKR .

[1] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[2] Sergey Levine,et al. Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[4] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[5] Masashi Sugiyama,et al. Guide Actor-Critic for Continuous Control , 2017, ICLR.

[6] Dirk P. Kroese,et al. The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[7] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[8] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[9] Bruno Scherrer,et al. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[12] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[13] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[14] Luís Paulo Reis,et al. Model-Based Relative Entropy Stochastic Search , 2016, NIPS.

[15] Martin A. Riedmiller,et al. Learning to Drive a Real Car in 20 Minutes , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[16] Jakub W. Pachocki,et al. Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[17] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[18] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[19] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[20] Jakub W. Pachocki,et al. Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[21] Gerhard Neumann,et al. Variational Inference for Policy Search in changing situations , 2011, ICML.

[22] Tom Schaul,et al. Fitness Expectation Maximization , 2008, PPSN.

[23] Yuval Tassa,et al. Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[24] Yuval Tassa,et al. Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[25] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[26] Shimon Whiteson,et al. Expected Policy Gradients , 2017, AAAI.

[27] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.