Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient

We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a γ-discounted MDP. We establish asymptotic normality for the averaged iteration Q̄T . Furthermore, we show that Q̄T is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function Q∗ with the most efficient influence function. It implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the `∞ error E‖Q̄T − Q‖∞, showing it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. As a byproduct, we find the Bellman noise has sub-Gaussian coordinates with variance O((1− γ)−1) instead of the prevailing O((1− γ)−2) under the standard bounded reward assumption. The sub-Gaussian result has potential to improve the sample complexity of many RL algorithms. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.

[1]  Yu-Xiang Wang,et al.  Towards Instance-Optimal Offline Reinforcement Learning with Pessimism , 2021, NeurIPS.

[2]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[5]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[6]  Jianqing Fan,et al.  Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model , 2021, ArXiv.

[7]  H. Robbins A Stochastic Approximation Method , 1951 .

[8]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[9]  Myung Hwan Seo,et al.  Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling , 2021, ArXiv.

[10]  Martin J. Wainwright,et al.  Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning , 2021, ArXiv.

[11]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .

[12]  T. Moore A Theory of Cramer-Rao Bounds for Constrained Parametric Models , 2010 .

[13]  Moritz Jirak,et al.  On Weak Invariance Principles for Partial Sums , 2017 .

[14]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[15]  W. Newey,et al.  Semiparametric Efficiency Bounds , 1990 .

[16]  Yuancheng Zhu,et al.  Uncertainty Quantification for Online Learning and Stochastic Approximation via Hierarchical Incremental Gradient Descent , 2018, 1802.04876.

[17]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[18]  Martin J. Wainwright,et al.  Optimal oracle inequalities for solving projected fixed-point equations , 2020, ArXiv.

[19]  Siva Theja Maguluri,et al.  Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes , 2020, ArXiv.

[20]  Niao He,et al.  A Unified Switching System Perspective and O.D.E. Analysis of Q-Learning Algorithms , 2019, ArXiv.

[21]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[22]  Zhihua Zhang,et al.  Towards Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics , 2021 .

[23]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[24]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[25]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[26]  Michael R. Kosorok,et al.  Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning , 2016, Journal of the American Statistical Association.

[27]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[28]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[29]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[30]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[31]  Thinh T. Doan,et al.  Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , 2019, Autom..

[32]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[33]  Xi Chen,et al.  Online Covariance Matrix Estimation in Stochastic Gradient Descent , 2020, Journal of the American Statistical Association.

[34]  Csaba Szepesv'ari,et al.  Bootstrapping Statistical Inference for Off-Policy Evaluation , 2021, ArXiv.

[35]  Yuantao Gu,et al.  Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction , 2022, IEEE Transactions on Information Theory.

[36]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[37]  Karthikeyan Shanmugam,et al.  A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , 2021, ArXiv.

[38]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[39]  S. Zhang,et al.  Statistical inference of the value function for reinforcement learning in infinite‐horizon settings , 2020, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[40]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[41]  Martin J. Wainwright,et al.  On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration , 2020, COLT.

[42]  Mark W. Schmidt,et al.  Variance-Reduced Methods for Machine Learning , 2020, Proceedings of the IEEE.

[43]  Paolo Paruolo,et al.  Simple Robust Testing of Regression Hypotheses: A Comment , 2001 .

[44]  R. Srikant,et al.  Error bounds for constant step-size Q-learning , 2012, Syst. Control. Lett..

[45]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[46]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[47]  Guanghui Lan,et al.  Accelerated and instance-optimal policy evaluation with linear function approximation , 2021, ArXiv.

[48]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[49]  Xin T. Tong,et al.  Statistical inference for model parameters in stochastic gradient descent , 2016, The Annals of Statistics.

[50]  Donghwan Lee,et al.  Target-Based Temporal Difference Learning , 2019, ICML.

[51]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[52]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[53]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[54]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[55]  H. Alzer Inequalities for the gamma function , 1999 .

[56]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[57]  Martin J. Wainwright,et al.  Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning , 2021, IEEE Transactions on Information Theory.

[58]  Changxiao Cai,et al.  Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis , 2021 .

[59]  Jean Jacod,et al.  Skorokhod Topology and Convergence of Processes , 2003 .

[60]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[61]  Martin J. Wainwright,et al.  Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis , 2020, SIAM J. Math. Data Sci..

[62]  Xiangyu Chang,et al.  Statistical Estimation and Inference via Local SGD in Federated Learning , 2021, ArXiv.

[63]  Yuantao Gu,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[64]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[65]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[66]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[67]  Xiangyang Ji,et al.  Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity , 2020, ICML.