论文信息 - Sparse Proximal Reinforcement Learning via Nested Optimization

Sparse Proximal Reinforcement Learning via Nested Optimization

We consider the tasks of feature selection and policy evaluation based on linear value function approximation in reinforcement learning problems. High-dimension feature vectors and limited number of samples can easily cause over-fitting and computation expensive. To prevent this problem, <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized method obtains sparse solutions and thus improves generalization performance. We propose an efficient <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized recursive least squares-based online algorithm with <inline-formula> <tex-math notation="LaTeX">${O}$ </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">${n} ^{{2}}$ </tex-math></inline-formula>) complexity per time-step, termed <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC. With the help of nested optimization decomposition, <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC solves a series of standard optimization problems and avoids minimizing mean squares projected Bellman error with <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularization directly. In <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC, we propose RC with iterative refinement to minimize the operator error, and we propose an alternating direction method of multipliers with proximal operator to minimize the fixed-point error. The convergence of <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC is established based on ordinary differential equation method and some extensions are also given. In empirical computations, some state-of-the-art <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-regularized methods are chosen as the baselines, and <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC are tested in both policy evaluation and learning control benchmarks. The empirical results show the effectiveness and advantages of <inline-formula> <tex-math notation="LaTeX">${\ell _{1}}$ </tex-math></inline-formula>-RC.

[1] Huaguang Zhang,et al. Fault-Tolerant Controller Design for a Class of Nonlinear MIMO Discrete-Time Systems via Online Reinforcement Learning Algorithm , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[2] Dazi Li,et al. Online ℓ2-regularized reinforcement learning via RBF neural network , 2016, 2016 Chinese Control and Decision Conference (CCDC).

[3] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[4] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[5] Bo Liu,et al. Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[6] Dewen Hu,et al. Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[7] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9] Huizhen Yu,et al. Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[10] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[11] 丸山徹. Convex Analysisの二,三の進展について , 1977 .

[12] Zhiwei Qin,et al. Sparse Reinforcement Learning via Convex Optimization , 2014, ICML.

[13] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[14] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[15] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[16] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[18] Qibing Jin,et al. Regularization and feature selection in least squares temporal difference with gradient correction , 2016, 2016 12th World Congress on Intelligent Control and Automation (WCICA).

[19] W. Marsden. I and J , 2012 .

[20] Ding Wang,et al. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey , 2015, International Journal of Automation and Computing.

[21] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[22] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[23] Bo Liu,et al. Sparse Q-learning with Mirror Descent , 2012, UAI.

[24] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[25] J. H. Wilkinson,et al. Note on the iterative refinement of least squares solution , 1966 .

[26] Yang Li,et al. Adaptive Neural Network Control of AUVs With Control Input Nonlinearities Using Reinforcement Learning , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[27] Yujing Hu,et al. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer , 2015, IEEE Transactions on Cybernetics.

[28] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[29] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[30] H. He,et al. Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[31] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[32] Reinaldo A. C. Bianchi,et al. Heuristically-Accelerated Multiagent Reinforcement Learning , 2014, IEEE Transactions on Cybernetics.

[33] Derong Liu,et al. Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[34] Bo Liu,et al. Regularized Off-Policy TD-Learning , 2012, NIPS.

[35] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[36] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[37] Matthew W. Hoffman,et al. Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[38] Kotaro Hirasawa,et al. Kernel-Based Least Squares Temporal Difference With Gradient Correction , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[39] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[40] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[41] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[42] Ronald Parr,et al. Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[43] M. Loth,et al. Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.