Fundamental Design Principles for Reinforcement Learning Algorithms

Along with the sharp increase in visibility of the field, the rate at which new reinforcement learning algorithms are being proposed is at a new peak. While the surge in activity is creating excitement and opportunities, there is a gap in understanding of two basic principles that these algorithms need to satisfy for any successful application. One has to do with guarantees for convergence, and the other concerns the convergence rate. The vast majority of reinforcement learning algorithms belong to a class of learning algorithms known as stochastic approximation (SA). The objective here is to review the foundations of reinforcement learning algorithm design based on recent and ancient results from SA. In particular, it was established in [Borkar and Meyn, 2000] that both stability and convergence of these algorithms are guaranteed by analyzing the stability of two associated ODEs. Moreover, if the linearized ODE passes a simple eigenvalue test, then an optimal rate of convergence is guaranteed. This chapter contains a survey of these concepts, along with a survey of the new class of Zap reinforcement learning algorithms introduced by the authors. These algorithms can achieve convergence almost universally, while also guaranteeing optimal rate of convergence.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[3]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[4]  H. Robbins A Stochastic Approximation Method , 1951 .

[5]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[6]  S. Meyn,et al.  Computable exponential convergence rates for stochastically ordered Markov processes , 1996 .

[7]  J. H. Venter An extension of the Robbins-Monro procedure , 1967 .

[8]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[9]  Sean P. Meyn,et al.  Zap Q-Learning for Optimal Stopping , 2020, 2020 American Control Conference (ACC).

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  J. Blum Multidimensional Stochastic Approximation Methods , 1954 .

[12]  S. Meyn Large deviation asymptotics and control variates for simulating large functions , 2006, math/0603328.

[13]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[14]  S. Meyn,et al.  Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[15]  P. Glynn,et al.  Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .

[16]  Sean P. Meyn,et al.  Most likely paths to error when estimating the mean of a reflected random walk , 2009, Perform. Evaluation.

[17]  Ana Busic,et al.  Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation , 2020, AISTATS.

[18]  A. Shwartz,et al.  Stochastic approximations for finite-state Markov chains , 1990 .

[19]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[20]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[21]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[22]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[23]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[24]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[25]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[26]  Ken R. Duffy,et al.  Large deviation asymptotics for busy periods , 2014 .

[27]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[28]  Ana Busic,et al.  On Matrix Momentum Stochastic Approximation and Applications to Q-learning , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[30]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[31]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[32]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[33]  Sean P. Meyn,et al.  TD-learning with exploration , 2011, IEEE Conference on Decision and Control and European Control Conference.

[34]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[35]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[36]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[37]  S. Meyn,et al.  Computable Bounds for Geometric Convergence Rates of Markov Chains , 1994 .

[38]  C. Watkins Learning from delayed rewards , 1989 .

[39]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[40]  Sean P. Meyn,et al.  Zap Q-Learning - A User's Guide , 2019, 2019 Fifth Indian Control Conference (ICC).

[41]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[42]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[43]  Sean P. Meyn,et al.  Oja's algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control , 2012, Autom..

[44]  D. Paulin Concentration inequalities for Markov chains by Marton couplings and spectral methods , 2012, 1212.2015.

[45]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[46]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[47]  M. Metivier,et al.  Applications of a Kushner and Clark lemma to general classes of stochastic algorithms , 1984, IEEE Trans. Inf. Theory.

[48]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[49]  Ana Busic,et al.  Zap Q-Learning With Nonlinear Function Approximation , 2019, NeurIPS.

[50]  Dimitri P. Bertsekas,et al.  Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[51]  Magnus Egerstedt,et al.  Performance regulation and tracking via lookahead simulation: Preliminary results and validation , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[52]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[53]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[54]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[55]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[56]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[57]  K. Chung On a Stochastic Approximation Method , 1954 .

[58]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[59]  Shie Mannor,et al.  Concentration Bounds for Two Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, ArXiv.

[60]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[61]  Sean P. Meyn,et al.  Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[62]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[63]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[64]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[65]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.