Unknown mixing times in apprenticeship and reinforcement learning

We derive and analyze learning algorithms for apprenticeship learning, policy evaluation, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

[1]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[2]  Aryeh Kontorovich,et al.  Estimating the Mixing Time of Ergodic Markov Chains , 2019, COLT.

[3]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[4]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[5]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[6]  Haim Kaplan,et al.  Apprenticeship Learning via Frank-Wolfe , 2019, AAAI.

[7]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[8]  KearnsMichael,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  David Bruce Wilson,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996, Random Struct. Algorithms.

[11]  Olle Häggström Finite Markov Chains and Algorithmic Applications , 2002 .

[12]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[15]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[16]  S. Karlin,et al.  Studies in the Mathematical Theory of Inventory and Production, by K.J. Arrow, S. Karlin, H. Scarf with contributions by M.J. Beckmann, J. Gessford, R.F. Muth. Stanford, California, Stanford University Press, 1958, X p.340p., $ 8.75. , 1959, Bulletin de l'Institut de recherches économiques et sociales.

[17]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[18]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[19]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[22]  Frank Kelly,et al.  Networks of queues with customers of different types , 1975, Journal of Applied Probability.

[23]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[24]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[26]  D. White Dynamic programming, Markov chains, and the method of successive approximations , 1963 .

[27]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[28]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[29]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[30]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[31]  V. Climenhaga Markov chains and mixing times , 2013 .

[32]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .