Finite-Memory Near-Optimal Learning for Markov Decision Processes with Long-Run Average Reward

We consider learning policies online in Markov decision processes with the long-run average reward (a.k.a. mean payoff). To ensure implementability of the policies, we focus on policies with finite memory. Firstly, we show that near optimality can be achieved almost surely, using an unintuitive gadget we call forgetfulness. Secondly, we extend the approach to a setting with partial knowledge of the system topology, introducing two optimality measures and providing near-optimal algorithms also for these cases.

[1]  David R. Karger,et al.  Route Planning under Uncertainty: The Canadian Traveller Problem , 2008, AAAI.

[2]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[3]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[4]  Jan Kretínský,et al.  PAC Statistical Model Checking for Markov Decision Processes and Stochastic Games , 2019, CAV.

[5]  Sven Schewe,et al.  Omega-Regular Objectives in Model-Free Reinforcement Learning , 2018, TACAS.

[6]  Peter Winkler,et al.  Exact Mixing in an Unknown Markov Chain , 1995, Electron. J. Comb..

[7]  Krishnendu Chatterjee,et al.  Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[8]  Jean-François Raskin,et al.  Safe and Optimal Scheduling for Hard and Soft Tasks , 2018, FSTTCS.

[9]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[10]  V. Climenhaga Markov chains and mixing times , 2013 .

[11]  Hugo Gimbert,et al.  Pure Stationary Optimal Strategies in Markov Decision Processes , 2007, STACS.

[12]  Edmund M. Clarke,et al.  Statistical Model Checking for Markov Decision Processes , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[13]  Kim G. Larsen,et al.  A modal process logic , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[14]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[15]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[16]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[17]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[18]  Thomas A. Henzinger,et al.  Faster Statistical Model Checking for Unbounded Temporal Properties , 2016, TACAS.

[19]  Jan Kretínský,et al.  Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints , 2018, CONCUR.

[20]  Mathieu Tracol,et al.  Fast convergence to state-action frequency polytopes for MDPs , 2009, Oper. Res. Lett..

[21]  Elizabeth Gibney,et al.  Google AI algorithm masters ancient game of Go , 2016, Nature.

[22]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[23]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[24]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[25]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[26]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[27]  Christel Baier,et al.  Principles of model checking , 2008 .

[28]  Richard Lassaigne,et al.  Approximate planning and verification for large Markov decision processes , 2012, SAC '12.