Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprenticeship learning, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[4]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[5]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[6]  D. White Dynamic programming, Markov chains, and the method of successive approximations , 1963 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  S. Karlin,et al.  Studies in the Mathematical Theory of Inventory and Production, by K.J. Arrow, S. Karlin, H. Scarf with contributions by M.J. Beckmann, J. Gessford, R.F. Muth. Stanford, California, Stanford University Press, 1958, X p.340p., $ 8.75. , 1959, Bulletin de l'Institut de recherches économiques et sociales.

[9]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[10]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[11]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[12]  David Bruce Wilson,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996, Random Struct. Algorithms.

[13]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[14]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15]  Aryeh Kontorovich,et al.  Estimating the Mixing Time of Ergodic Markov Chains , 2019, COLT.

[16]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[17]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[18]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[19]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[20]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[21]  Olle Häggström Finite Markov Chains and Algorithmic Applications , 2002 .

[22]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[23]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[24]  Frank Kelly,et al.  Networks of queues with customers of different types , 1975, Journal of Applied Probability.

[25]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[26]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[27]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[28]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.