On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U(θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U(θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.

[1]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[2]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[3]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[5]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[6]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[9]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[10]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[11]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[12]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[13]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[14]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15]  Shin Ishii,et al.  Reinforcement Learning for CPG-Driven Biped Robot , 2004, AAAI.

[16]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[17]  Xi-Ren Cao,et al.  A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[18]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[21]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[23]  Andrew Y. Ng,et al.  Learning omnidirectional path following using dimensionality reduction , 2007, Robotics: Science and Systems.

[24]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[25]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[26]  C. Stachniss,et al.  Learning Omnidirectional Path Following Using Dimensionality Reduction , 2008 .

[27]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[28]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[29]  Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .