论文信息 - Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning - 字舞流文

Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning

Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009.)

Masashi Sugiyama | Jan Peters | Hirotaka Hachiya | Jan Peters | Masashi Sugiyama | Hirotaka Hachiya

[1] Masashi Sugiyama,et al. Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error , 2006, J. Mach. Learn. Res..

[2] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[3] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[4] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[5] S. Vijayakumar,et al. Competitive-Cooperative-Concurrent Reinforcement Learning with Importance Sampling , 2004 .

[6] Motoaki Kawanabe,et al. Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression , 2004, Neural Computation.

[7] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[8] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[9] Klaus-Robert Müller,et al. Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[10] Masashi Sugiyama,et al. Adaptive importance sampling for value function approximation in off-policy reinforcement learning , 2009, Neural Networks.

[11] Christian R. Shelton,et al. Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[12] Masashi Sugiyama,et al. Input-dependent estimation of generalization error under covariate shift , 2005 .

[13] H. Shimodaira,et al. Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[14] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.

[15] Jeff G. Schneider,et al. Policy Search by Dynamic Programming , 2003, NIPS.

[16] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[17] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[18] Geoffrey E. Hinton,et al. Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[19] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21] Masashi Sugiyama,et al. Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning , 2009, IJCAI.

[22] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[23] Mark W. Spong,et al. The swing up control problem for the Acrobot , 1995 .

[24] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[25] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[26] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[27] Jan Peters,et al. Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[28] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29] Masashi Sugiyama,et al. Efficient Sample Reuse in EM-Based Policy Search , 2009, ECML/PKDD.