Value function estimation in Markov reward processes: Instance-dependent 𝓁∞-bounds for policy evaluation

Markov reward processes (MRPs) are used to model stochastic phenomena arising in operations research, control engineering, robotics, artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the long-term value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of estimating the value function of an infinite-horizon, discounted MRP in the $\ell_\infty$-norm. We analyze both the standard plug-in approach to this problem and a more robust variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observed data. We show that these approaches are minimax-optimal up to constant factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.

[1]  D. V. Lindley,et al.  An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[2]  V. B. Tadic,et al.  On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.

[3]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[4]  Matthieu Lerasle,et al.  ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS: THEORY AND PRACTICE , 2019 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  T. Cai,et al.  An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[7]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[8]  Shie Mannor,et al.  How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[9]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[10]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[11]  Martin J. Wainwright,et al.  The Local Geometry of Testing in Ellipses: Tight Control via Localized Kolmogorov Widths , 2017, IEEE Transactions on Information Theory.

[12]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[13]  Frank E. Grubbs,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[14]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[15]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[16]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[17]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[18]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[19]  Kiyosi Itô,et al.  Essentials of Stochastic Processes , 2006 .

[20]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[21]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[22]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[23]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[24]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[25]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Jonathan P. How,et al.  Improving PAC Exploration Using the Median Of Means , 2016, NIPS.

[28]  John D. Lafferty,et al.  Local Minimax Complexity of Stochastic Convex Optimization , 2016, NIPS.

[29]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[30]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[33]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[34]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[35]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[36]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[37]  Andrew Vince A rearrangement inequality and the permutahedron , 1990 .

[38]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[39]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[40]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[41]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[42]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[43]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution , 2017, Found. Comput. Math..

[44]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[45]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[46]  Lin F. Yang,et al.  On the Optimality of Sparse Model-Based Planning for Markov Decision Processes , 2019, ArXiv.

[47]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[48]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[49]  L. Sherry,et al.  Estimating Taxi-out times with a reinforcement learning algorithm , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.

[50]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[51]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[52]  Leslie G. Valiant,et al.  Random Generation of Combinatorial Structures from a Uniform Distribution , 1986, Theor. Comput. Sci..

[53]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .