论文信息 - Policy Search in Reproducing Kernel Hilbert Space

Policy Search in Reproducing Kernel Hilbert Space

Modeling policies in reproducing kernel Hilbert space (RKHS) renders policy gradient reinforcement learning algorithms non-parametric. As a result, the policies become very flexible and have a rich representational potential without a predefined set of features. However, their performances might be either non-covariant under reparameterization of the chosen kernel, or very sensitive to step-size selection. In this paper, we propose to use a general framework to derive a new RKHS policy search technique. The new derivation leads to both a natural RKHS actor-critic algorithm and a RKHS expectation maximization (EM) policy search algorithm. Further, we show that kernelization enables us to learn in partially observable (POMDP) tasks which is considered daunting for parametric approaches. Via sparsification, a small set of "support vectors" representing the history is shown to be effectively discovered. For evaluations, we use three simulated (PO)MDP reinforcement learning tasks, and a simulated PR2's robotic manipulation task. The results demonstrate the effectiveness of the new RKHS policy search framework in comparison to plain RKHS actor-critic, episodic natural actor-critic, plain actor-critic, and PoWER approaches.

Peter Englert | Marc Toussaint | Ngo Anh Vien | Marc Toussaint | Péter Englert

[1] M. V. Rossum,et al. In Neural Computation , 2022 .

[2] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[3] Thomas G. Dietterich,et al. In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[4] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[5] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[6] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[7] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8] J. Tsitsiklis,et al. Actor-citic agorithms , 1999, NIPS 1999.

[9] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[10] Alexander J. Smola,et al. Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[11] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[12] Jeff G. Schneider,et al. Covariant policy search , 2003, IJCAI 2003.

[13] Anthony Widjaja,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[14] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15] Pascal Vincent,et al. Kernel Matching Pursuit , 2002, Machine Learning.

[16] Charles A. Micchelli,et al. On Learning Vector-Valued Functions , 2005, Neural Computation.

[17] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[18] S. Nordebo,et al. Fisher information integral operator and spectral decomposition for inverse scattering problems , 2009 .

[19] Marc Toussaint,et al. Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[20] Jan Peters,et al. Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[21] Gerhard Neumann,et al. Variational Inference for Policy Search in changing situations , 2011, ICML.

[22] Jan Peters,et al. Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[23] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[24] Byron Boots,et al. Hilbert Space Embeddings of Predictive State Representations , 2013, UAI.

[25] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[26] Jan Peters,et al. Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[27] Guy Lever,et al. Modelling Policies in MDPs in Reproducing Kernel Hilbert Space , 2015, AISTATS.

[28] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[29] Hyunjoong Kim,et al. Functional Analysis I , 2017 .

[30] W. S. Robinson. Robots , 2018, Epiphenomenal Mind.