Kernel-Based Least Squares Policy Iteration for Reinforcement Learning

In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  R. Murray,et al.  Nonlinear controllers for non-integrable systems: the Acrobot example , 1990, 1990 American Control Conference.

[3]  M. W. Spong,et al.  Pseudolinearization of the acrobot using spline functions , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[4]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Mark W. Spong,et al.  The swing up control problem for the Acrobot , 1995 .

[7]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[8]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[9]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12]  Stephen Yurkovich,et al.  Fuzzy Control , 1997 .

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[15]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[16]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[17]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[18]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[19]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[20]  K. Passino Intelligent Control : An Overview of Techniques ∗ , 2000 .

[21]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[22]  T. Samad Intelligent Control: An Overview of Techniques , 2001 .

[23]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[24]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[25]  H. He,et al.  Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[26]  Xin Xu,et al.  Residual-gradient-based neural reinforcement learning for the optimal control of an acrobot , 2002, Proceedings of the IEEE Internatinal Symposium on Intelligent Control.

[27]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[28]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[29]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[30]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[31]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[32]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[33]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[34]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[35]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[36]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[37]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[38]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[39]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[42]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[43]  R. Clayton,et al.  Epicardial ECG Mapping of Human Ventricular Fibrillation , 2006 .

[44]  Kevin M. Passino,et al.  Biomimicry for Optimization, Control and Automation , 2004, IEEE Transactions on Automatic Control.

[45]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .