Policy Search in Reproducing Kernel Hilbert Space

Modeling policies in reproducing kernel Hilbert space (RKHS) renders policy gradient reinforcement learning algorithms non-parametric. As a result, the policies become very flexible and have a rich representational potential without a predefined set of features. However, their performances might be either non-covariant under reparameterization of the chosen kernel, or very sensitive to step-size selection. In this paper, we propose to use a general framework to derive a new RKHS policy search technique. The new derivation leads to both a natural RKHS actor-critic algorithm and a RKHS expectation maximization (EM) policy search algorithm. Further, we show that kernelization enables us to learn in partially observable (POMDP) tasks which is considered daunting for parametric approaches. Via sparsification, a small set of "support vectors" representing the history is shown to be effectively discovered. For evaluations, we use three simulated (PO)MDP reinforcement learning tasks, and a simulated PR2's robotic manipulation task. The results demonstrate the effectiveness of the new RKHS policy search framework in comparison to plain RKHS actor-critic, episodic natural actor-critic, plain actor-critic, and PoWER approaches.

[1]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[2]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[6]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  J. Tsitsiklis,et al.  Actor-citic agorithms , 1999, NIPS 1999.

[9]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[10]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[11]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[12]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[13]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[14]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[16]  Charles A. Micchelli,et al.  On Learning Vector-Valued Functions , 2005, Neural Computation.

[17]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[18]  S. Nordebo,et al.  Fisher information integral operator and spectral decomposition for inverse scattering problems , 2009 .

[19]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[20]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[21]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[22]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[23]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[24]  Byron Boots,et al.  Hilbert Space Embeddings of Predictive State Representations , 2013, UAI.

[25]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[26]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[27]  Guy Lever,et al.  Modelling Policies in MDPs in Reproducing Kernel Hilbert Space , 2015, AISTATS.

[28]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[29]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[30]  W. S. Robinson Robots , 2018, Epiphenomenal Mind.