Sparse Proximal Reinforcement Learning via Nested Optimization

We consider the tasks of feature selection and policy evaluation based on linear value function approximation in reinforcement learning problems. High-dimension feature vectors and limited number of samples can easily cause over-fitting and computation expensive. To prevent this problem, <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized method obtains sparse solutions and thus improves generalization performance. We propose an efficient <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized recursive least squares-based online algorithm with <inline-formula> <tex-math notation="LaTeX">${O}$ </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">${n} ^{{2}}$ </tex-math></inline-formula>) complexity per time-step, termed <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC. With the help of nested optimization decomposition, <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC solves a series of standard optimization problems and avoids minimizing mean squares projected Bellman error with <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularization directly. In <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC, we propose RC with iterative refinement to minimize the operator error, and we propose an alternating direction method of multipliers with proximal operator to minimize the fixed-point error. The convergence of <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC is established based on ordinary differential equation method and some extensions are also given. In empirical computations, some state-of-the-art <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized methods are chosen as the baselines, and <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC are tested in both policy evaluation and learning control benchmarks. The empirical results show the effectiveness and advantages of <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC.

[1]  Huaguang Zhang,et al.  Fault-Tolerant Controller Design for a Class of Nonlinear MIMO Discrete-Time Systems via Online Reinforcement Learning Algorithm , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[2]  Dazi Li,et al.  Online ℓ2-regularized reinforcement learning via RBF neural network , 2016, 2016 Chinese Control and Decision Conference (CCDC).

[3]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[4]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[5]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[6]  Dewen Hu,et al.  Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[7]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[10]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[11]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[12]  Zhiwei Qin,et al.  Sparse Reinforcement Learning via Convex Optimization , 2014, ICML.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[15]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[18]  Qibing Jin,et al.  Regularization and feature selection in least squares temporal difference with gradient correction , 2016, 2016 12th World Congress on Intelligent Control and Automation (WCICA).

[19]  W. Marsden I and J , 2012 .

[20]  Ding Wang,et al.  Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey , 2015, International Journal of Automation and Computing.

[21]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[22]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[23]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[24]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[25]  J. H. Wilkinson,et al.  Note on the iterative refinement of least squares solution , 1966 .

[26]  Yang Li,et al.  Adaptive Neural Network Control of AUVs With Control Input Nonlinearities Using Reinforcement Learning , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[27]  Yujing Hu,et al.  Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer , 2015, IEEE Transactions on Cybernetics.

[28]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[29]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[30]  H. He,et al.  Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[31]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[32]  Reinaldo A. C. Bianchi,et al.  Heuristically-Accelerated Multiagent Reinforcement Learning , 2014, IEEE Transactions on Cybernetics.

[33]  Derong Liu,et al.  Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[35]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[36]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[37]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[38]  Kotaro Hirasawa,et al.  Kernel-Based Least Squares Temporal Difference With Gradient Correction , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[40]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[41]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[42]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[43]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.