Manifold Regularization for Kernelized LSTD

Policy evaluation or value function or Q-function approximation is a key procedure in reinforcement learning (RL). It is a necessary component of policy iteration and can be used for variance reduction in policy gradient methods. Therefore its quality has a significant impact on most RL algorithms. Motivated by manifold regularized learning, we propose a novel kernelized policy evaluation method that takes advantage of the intrinsic geometry of the state space learned from data, in order to achieve better sample efficiency and higher accuracy in Q-function approximation. Applying the proposed method in the Least-Squares Policy Iteration (LSPI) framework, we observe superior performance compared to widely used parametric basis functions on two standard benchmarks in terms of policy quality.

[1]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[2]  Masashi Sugiyama,et al.  Geodesic Gaussian kernels for value function approximation , 2008, Auton. Robots.

[3]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[4]  Sridhar Mahadevan,et al.  Representation Policy Iteration , 2005, UAI.

[5]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[6]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Le Song,et al.  Learning from Conditional Distributions via Dual Kernel Embeddings , 2016, ArXiv.

[9]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[10]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[11]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[12]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[13]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[14]  Mikhail Belkin,et al.  Semi-supervised Learning by Higher Order Regularization , 2011, AISTATS.

[15]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[17]  Scott Kuindersma,et al.  Optimization and stabilization of trajectories for constrained dynamical systems , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[19]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[20]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[21]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[22]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[23]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[24]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[25]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[26]  Derong Liu,et al.  Manifold Regularized Reinforcement Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[28]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[29]  Krzysztof Choromanski,et al.  The Unreasonable Effectiveness of Random Orthogonal Embeddings , 2017 .