论文信息 - Importance sampling policy gradient algorithms in reproducing kernel Hilbert space - 字舞流文

Importance sampling policy gradient algorithms in reproducing kernel Hilbert space

Modeling policies in reproducing kernel Hilbert space (RKHS) offers a very flexible and powerful new family of policy gradient algorithms called RKHS policy gradient algorithms. They are designed to optimize over a space of very high or infinite dimensional policies. As a matter of fact, they are known to suffer from a large variance problem. This critical issue comes from the fact that updating the current policy is based on a functional gradient that does not exploit all old episodes sampled by previous policies. In this paper, we introduce a generalized RKHS policy gradient algorithm that integrates the following important ideas: (i) policy modeling in RKHS; (ii) normalized importance sampling, which helps reduce the estimation variance by reusing previously sampled episodes in a principled way; and (iii) regularization terms, which avoid updating the policy too over-fit to sampled data. In the experiment section, we provide an analysis of the proposed algorithms through bench-marking domains. The experiment results show that the proposed algorithm can still enjoy a powerful policy modeling in RKHS and achieve more data-efficiency.

TaeChoong Chung | Ngo Anh Vien | Tuyen Pham Le | P. Marlith Jaramillo | TaeChoong Chung | T. P. Le | P. M. Jaramillo

[1] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[2] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[3] Peter Englert,et al. Policy Search in Reproducing Kernel Hilbert Space , 2016, IJCAI.

[4] Noah J. Cowan,et al. Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[5] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[6] Jan Peters,et al. Learning concurrent motor skills in versatile solution spaces , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7] Jan Peters,et al. Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[8] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[9] Jun Morimoto,et al. Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[10] R. D. Nardi. The QRSim Quadrotors Simulator , 2013 .

[11] Guy Lever,et al. Modelling Policies in MDPs in Reproducing Kernel Hilbert Space , 2015, AISTATS.

[12] A. Atiya,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[13] Sergey Levine,et al. Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[14] Jean-Philippe Thiran,et al. Kernel matching pursuit for large datasets , 2005, Pattern Recognit..

[15] Christian R. Shelton,et al. Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[16] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[17] Gerhard Neumann,et al. Variational Inference for Policy Search in changing situations , 2011, ICML.

[18] Jan Peters,et al. Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[19] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[20] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[21] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[22] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[23] Charles A. Micchelli,et al. On Learning Vector-Valued Functions , 2005, Neural Computation.

[24] Masashi Sugiyama,et al. Adaptive importance sampling for value function approximation in off-policy reinforcement learning , 2009, Neural Networks.

[25] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[26] Peyman Milanfar,et al. A Tour of Modern Image Filtering: New Insights and Methods, Both Practical and Theoretical , 2013, IEEE Signal Processing Magazine.

[27] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[28] Alex Smola,et al. Kernel methods in machine learning , 2007, math/0701907.

[29] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[30] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[31] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[32] Jan Peters,et al. Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[33] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[34] J. Bagnell,et al. Policy search in kernel Hilbert space , 2003 .