PID Accelerated Value Iteration Algorithm

The convergence rate of Value Iteration (VI), a fundamental procedure in dynamic programming and reinforcement learning, for solving MDPs can be slow when the discount factor is close to one. We propose modifications to VI in order to potentially accelerate its convergence behaviour. The key insight is the realization that the evolution of the value function approximations (Vk)k≥0 in the VI procedure can be seen as a dynamical system. This opens up the possibility of using techniques from control theory to modify, and potentially accelerate, this dynamics. We present such modifications based on simple controllers, such as PD (Proportional-Derivative), PI (ProportionalIntegral), and PID. We present the error dynamics of these variants of VI, and provably (for certain classes of MDPs) and empirically (for more general classes) show that the convergence rate can be significantly improved. We also propose a gain adaptation mechanism in order to automatically select the controller gains, and empirically show the effectiveness of this procedure.

[1]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[2]  Thibault Langlois,et al.  Parameter adaptation in stochastic optimization , 1999 .

[3]  Harold J. Kushner,et al.  Accelerated procedures for the solution of discrete Markov control problems , 1971 .

[4]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[5]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[6]  Ian Postlethwaite,et al.  Multivariable Feedback Control: Analysis and Design , 1996 .

[7]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[8]  E. Yaz,et al.  Linear optimal control, H2 and H∞ methods, by Jeffrey B. Burl, Addison Wesley Longman, Inc. Menlo Park, CA, 1999 , 2000 .

[9]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[10]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[11]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[12]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[13]  J. Doyle,et al.  Essentials of Robust Control , 1997 .

[14]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[15]  V. Berinde Iterative Approximation of Fixed Points , 2007 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[18]  Matthieu Geist,et al.  Anderson Acceleration for Reinforcement Learning , 2018, EWRL 2018.

[19]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[20]  Kevin D. Seppi,et al.  Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[21]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[22]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[23]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[24]  R. Jungers The Joint Spectral Radius: Theory and Applications , 2009 .

[25]  Peter Kuster,et al.  Nonlinear And Adaptive Control Design , 2016 .

[26]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[27]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[28]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.

[29]  Bruno Scherrer,et al.  Momentum in Reinforcement Learning , 2020, AISTATS.

[30]  Donald G. M. Anderson Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.

[31]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[32]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[33]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[34]  Naresh K. Sinha,et al.  Modern Control Systems , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[35]  P. N. Paraskevopoulos,et al.  Modern Control Engineering , 2001 .

[36]  Gao Huang,et al.  Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning , 2019, NeurIPS.

[37]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[38]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Geoffrey J. Gordon,et al.  Fast Exact Planning in Markov Decision Processes , 2005, ICAPS.

[40]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[41]  Vineet Goyal,et al.  A First-Order Approach to Accelerated Value Iteration , 2019, Oper. Res..