Continuous-time Value Function Approximation in Reproducing Kernel Hilbert Spaces

Motivated by the success of reinforcement learning (RL) for discrete-time tasks such as AlphaGo and Atari games, there has been a recent surge of interest in using RL for continuous-time control of physical systems (cf. many challenging tasks in OpenAI Gym and DeepMind Control Suite). Since discretization of time is susceptible to error, it is methodologically more desirable to handle the system dynamics directly in continuous time. However, very few techniques exist for continuous-time RL and they lack flexibility in value function approximation. In this paper, we propose a novel framework for model-based continuous-time value function approximation in reproducing kernel Hilbert spaces. The resulting framework is so flexible that it can accommodate any kind of kernel-based approach, such as Gaussian processes and kernel adaptive filters, and it allows us to handle uncertainties and nonstationarity without prior knowledge about the environment or what basis functions to employ. We demonstrate the validity of the presented framework through experiments.

[1]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[2]  R. D. Wood,et al.  Nonlinear Continuum Mechanics for Finite Element Analysis , 1997 .

[3]  Masahiro Yukawa,et al.  Multikernel Adaptive Filtering , 2012, IEEE Transactions on Signal Processing.

[4]  Paulo Tabuada,et al.  Control Barrier Function Based Quadratic Programs with Application to Automotive Safety Systems , 2016, ArXiv.

[5]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[6]  Yunpeng Pan,et al.  Probabilistic Differential Dynamic Programming , 2014, NIPS.

[7]  Paulo Tabuada,et al.  Robustness of Control Barrier Functions for Safety Critical Control , 2016, ADHS.

[8]  Weifeng Liu,et al.  Kernel Adaptive Filtering , 2010 .

[9]  Li Wang,et al.  Safe Learning of Quadrotor Dynamics Using Barrier Certificates , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Li Wang,et al.  Safety-aware Adaptive Reinforcement Learning with Applications to Brushbot Navigation , 2018, ArXiv.

[11]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[12]  H. Minh,et al.  Some Properties of Gaussian Reproducing Kernel Hilbert Spaces and Their Implications for Function Approximation and Learning Theory , 2010 .

[13]  Paul Honeine,et al.  Online Prediction of Time Series Data With Kernels , 2009, IEEE Transactions on Signal Processing.

[14]  V. Borkar Controlled diffusion processes , 2005, math/0511077.

[15]  Le Song,et al.  A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Guy Lever,et al.  Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[18]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[19]  Kenji Fukumizu,et al.  Hilbert Space Embeddings of POMDPs , 2012, UAI.

[20]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[21]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[22]  Paulo Tabuada,et al.  Control Barrier Function Based Quadratic Programs for Safety Critical Systems , 2016, IEEE Transactions on Automatic Control.

[23]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[24]  Alejandro Ribeiro,et al.  Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  J. Andrew Bagnell,et al.  Online Bellman Residual and Temporal Difference Algorithms with Predictive Error Guarantees , 2016, IJCAI.

[27]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[28]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[29]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[30]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[31]  Masahiro Yukawa,et al.  Adaptive Nonlinear Estimation Based on Parallel Projection Along Affine Subspaces in Reproducing Kernel Hilbert Space , 2015, IEEE Transactions on Signal Processing.

[32]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[33]  Masashi Sugiyama,et al.  Statistical Reinforcement Learning - Modern Machine Learning Approaches , 2015, Chapman and Hall / CRC machine learning and pattern recognition series.

[34]  Frank Allgöwer,et al.  CONSTRUCTIVE SAFETY USING CONTROL BARRIER FUNCTIONS , 2007 .

[35]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[36]  Thomas B. Schön,et al.  Linearly constrained Gaussian processes , 2017, NIPS.

[37]  R. Khasminskii Stochastic Stability of Differential Equations , 1980 .

[38]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[39]  Aaron D. Ames,et al.  Safety Barrier Certificates for Collisions-Free Multirobot Systems , 2017, IEEE Transactions on Robotics.

[40]  Li Wang,et al.  Barrier-Certified Adaptive Reinforcement Learning With Applications to Brushbot Navigation , 2018, IEEE Transactions on Robotics.

[41]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[42]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[43]  Koushil Sreenath,et al.  Discrete Control Barrier Functions for Safety-Critical Control of Discrete Systems with Application to Bipedal Robot Navigation , 2017, Robotics: Science and Systems.

[44]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[45]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[46]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[47]  Magnus Egerstedt,et al.  Nonsmooth Barrier Functions With Applications to Multi-Robot Systems , 2017, IEEE Control Systems Letters.

[48]  G. Strang Introduction to Linear Algebra , 1993 .

[49]  Peter Stone,et al.  Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference , 2017, IEEE Transactions on Automatic Control.

[50]  Aaron D. Ames,et al.  Sufficient conditions for the Lipschitz continuity of QP-based multi-objective control of humanoid robots , 2013, 52nd IEEE Conference on Decision and Control.

[51]  I. Yamada,et al.  Adaptive Projected Subgradient Method for Asymptotic Minimization of Sequence of Nonnegative Convex Functions , 2005 .

[52]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[53]  P. Olver Nonlinear Systems , 2013 .

[54]  Ding-Xuan Zhou Derivative reproducing properties for kernel methods in learning theory , 2008 .

[55]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[56]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[57]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[58]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[59]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[60]  Yuval Tassa,et al.  Stochastic Differential Dynamic Programming , 2010, Proceedings of the 2010 American Control Conference.

[61]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.