论文信息 - Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration - 字舞流文

Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration

The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: (1) policy gradients with parameter-based exploration, a recently proposed policy search method with low variance of gradient estimates; (2) an importance sampling technique, which allows us to reuse previously gathered data in a consistent way; and (3) an optimal baseline, which minimizes the variance of gradient estimates with their unbiasedness being maintained. For the proposed method, we give a theoretical analysis of the variance of gradient estimates and show its usefulness through extensive experiments.

Jun Morimoto | Masashi Sugiyama | Voot Tangkaratt | Tingting Zhao | Hirotaka Hachiya | J. Morimoto | Masashi Sugiyama | H. Hachiya | Voot Tangkaratt | Tingting Zhao

[1] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[2] Gang Niu,et al. Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[3] Masashi Sugiyama,et al. Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning , 2011, Neural Computation.

[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5] H. Shimodaira,et al. Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[6] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[7] Kenji Doya,et al. Natural actor-critic with baseline adjustment for variance reduction , 2008, Artificial Life and Robotics.

[8] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9] Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[10] Ronald L. Wasserstein,et al. Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[11] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[12] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13] S. Vijayakumar,et al. Competitive-Cooperative-Concurrent Reinforcement Learning with Importance Sampling , 2004 .

[14] Jun Morimoto,et al. CB: A Humanoid Research Platform for Exploring NeuroScience , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[15] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[16] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[17] Nicolas Meuleau,et al. Exploration in Gradient-Based Reinforcement Learning , 2001 .

[18] Frank Sehnke,et al. Parameter-exploring policy gradients , 2010, Neural Networks.

[19] Christian R. Shelton,et al. Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[20] Jeff G. Schneider,et al. Policy Search by Dynamic Programming , 2003, NIPS.

[21] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.

[22] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[24] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[25] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[26] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[27] Jun Morimoto,et al. Adaptive Step-size Policy Gradients with Average Reward Metric , 2010, ACML.

[28] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .