Online Relative Entropy Policy Search using Reproducing Kernel Hilbert Space Embeddings

Kernel methods have been successfully applied to reinforcement learning problems to address some challenges such as high dimensional and continuous states, value function approximation and state transition probability modeling. In this paper, we develop an online policy search algorithm based on a recent state-of-the-art algorithm REPS-RKHS that uses conditional kernel embeddings. Our online algorithm inherits the advantages of REPS-RKHS, including the ability to learn non-parametric control policies for infinite horizon continuous MDPs with highdimensional sensory representations. Different from the original REPS-RKHS algorithm which is based on batch learning, the proposed online algorithm updates the model in an online fashion and thus is able to capture and respond to rapid changes in the system dynamics. In addition, the online update operation takes constant time (i.e., independent of the sample size n), which is much more efficient computationally and allows the policy to be continuously revised. Experiments on different domains are conducted and results show that our online algorithm outperforms the original algorithm.

[1]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[2]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[3]  Jason Pazis,et al.  Non-Parametric Approximate Linear Programming for MDPs , 2011, AAAI.

[4]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[7]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[8]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Guy Lever,et al.  Modelling Policies in MDPs in Reproducing Kernel Hilbert Space , 2015, AISTATS.

[11]  Richard M. Johnstone,et al.  Exponential convergence of recursive least squares with exponential forgetting factor , 1982, 1982 21st IEEE Conference on Decision and Control.

[12]  Oliver Kroemer,et al.  Learning sequential motor tasks , 2013, 2013 IEEE International Conference on Robotics and Automation.

[13]  Alexander J. Smola,et al.  Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[14]  Guy Lever,et al.  Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  Kenji Fukumizu,et al.  Hilbert Space Embeddings of POMDPs , 2012, UAI.