A Reduction from Reinforcement Learning to No-Regret Online Learning

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any $\gamma$-discounted tabular RL problem, with probability at least $1-\delta$, it learns an $\epsilon$-optimal policy using at most $\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of $|\mathcal{S}|$,$|\mathcal{A}|$, though at the cost of potential approximation bias.

[1]  Byron Boots,et al.  Online Learning with Continuous Variations: Dynamic Regret and Reductions , 2020, AISTATS.

[2]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[3]  Byron Boots,et al.  Predictor-Corrector Policy Optimization , 2018, ICML.

[4]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[5]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[6]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[7]  W. Oettli,et al.  From optimization and variational inequalities to equilibrium problems , 1994 .

[8]  Mengdi Wang,et al.  An online primal-dual method for discounted Markov decision processes , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[9]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Qihang Lin,et al.  Revisiting Approximate Linear Programming Using a Saddle Point Based Reformulation and Root Finding Solution Approach , 2017 .

[12]  Niao He,et al.  Stochastic Primal-Dual Q-Learning , 2018, 1810.08298.

[13]  E. Denardo,et al.  Multichain Markov Renewal Programs , 1968 .

[14]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[15]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[16]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[17]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[18]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[19]  Monica Bianchi,et al.  Generalized monotone bifunctions and equilibrium problems , 1996 .

[20]  Le Song,et al.  Boosting the Actor with Dual Critic , 2017, ICLR.

[21]  M. Habib Probabilistic methods for algorithmic discrete mathematics , 1998 .

[22]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[23]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Alan S. Manne,et al.  Linear Programming and Sequential Decision Models , 1959 .

[25]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[26]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[27]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[28]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[29]  C. McDiarmid Concentration , 1862, The Dental register.

[30]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[31]  Shalabh Bhatnagar,et al.  A Linearly Relaxed Approximate Linear Program for Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[32]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[33]  Roger J.-B. Wets,et al.  Variational Convergence of Bifunctions: Motivating Applications , 2014, SIAM J. Optim..

[34]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[35]  Peter L. Bartlett,et al.  Blackwell Approachability and No-Regret Learning are Equivalent , 2010, COLT.

[36]  Byron Boots,et al.  Convergence of Value Aggregation for Imitation Learning , 2018, AISTATS.

[37]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[38]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[39]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.