Reordering Sparsification of Kernel Machines in Approximate Policy Iteration

Approximate policy iteration (API), which includes least-squares policy iteration (LSPI) and its kernelized version (KLSPI), has received increasing attention due to their good convergence and generalization abilities in solving difficult reinforcement learning problems. However, the sparsification of feature vectors, especially the kernel-based features, greatly influences the performance of API methods. In this paper, a novel reordering sparsification method is proposed for sparsifiying kernel machines in API. In this method, a greedy strategy is adopted, which adds the sample with the maximal squared approximation error to the kernel dictionary, so that the samples are reordered to improve the performance of kernel sparsification. Experimental results on the learning control of an inverted pendulum verify that by using the proposed algorithm, the size of the kernel dictionary is smaller than that of the previous sequential sparsification algorithm with the same level of sparsity, and the performance of the control policies learned by KLSPI can also be improved.

[1]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[2]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[3]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[7]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[8]  H. He,et al.  Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[9]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[10]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[13]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.