Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all $\epsilon$ and $\gamma$ we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an $\epsilon$-optimal mean payoff with probability at least $1 - \gamma$. (ii) Alternatively, for all $\epsilon$ and $\gamma$ there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an $\epsilon$-optimal mean payoff with probability at least $1 - \gamma$. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

[1]  Mathieu Tracol,et al.  Fast convergence to state-action frequency polytopes for MDPs , 2009, Oper. Res. Lett..

[2]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[3]  Sebastian Junges,et al.  Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[4]  Eilon Solan Continuity of the Value of Competitive Markov Decision Processes , 2003 .

[5]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[6]  Benjamin Aminof,et al.  First-cycle games , 2014, Inf. Comput..

[7]  Krishnendu Chatterjee,et al.  Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[8]  Krzysztof R. Apt,et al.  Lectures in Game Theory for Computer Scientists , 2011 .

[9]  Kim G. Larsen,et al.  On Time with Minimal Expected Cost! , 2014, ATVA.

[10]  Mickael Randour,et al.  Threshold Constraints with Guarantees for Parity Objectives in Markov Decision Processes , 2017, ICALP.

[11]  Wolfgang Thomas,et al.  On the Synthesis of Strategies in Infinite Games , 1995, STACS.

[12]  Krishnendu Chatterjee Concurrent games with tail objectives , 2007, Theor. Comput. Sci..

[13]  Thomas A. Henzinger,et al.  Faster Statistical Model Checking for Unbounded Temporal Properties , 2016, TACAS.

[14]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[15]  Peter Winkler,et al.  Exact Mixing in an Unknown Markov Chain , 1995, Electron. J. Comb..

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  Krishnendu Chatterjee,et al.  Energy and Mean-Payoff Parity Markov Decision Processes , 2011, MFCS.

[18]  Cristian S. Calude,et al.  Deciding parity games in quasipolynomial time , 2017, STOC.

[19]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[20]  Véronique Bruyère,et al.  Meet Your Expectations With Guarantees: Beyond Worst-Case Synthesis in Quantitative Games , 2013, STACS.

[21]  Benjamin Aminof,et al.  First-cycle games , 2017, Inf. Comput..

[22]  Orna Kupferman,et al.  Minimizing Expected Cost Under Hard Boolean Constraints, with Applications to Quantitative Synthesis , 2016, CONCUR.

[23]  Krishnendu Chatterjee,et al.  Robustness of Structurally Equivalent Concurrent Parity Games , 2011, FoSSaCS.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Ufuk Topcu,et al.  Correct-by-synthesis reinforcement learning with temporal logic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Hugo Gimbert,et al.  Pure Stationary Optimal Strategies in Markov Decision Processes , 2007, STACS.

[27]  Ufuk Topcu,et al.  Probably Approximately Correct Learning in Stochastic Games with Temporal Logic Specifications , 2016, IJCAI.

[28]  Petr Novotný,et al.  Optimizing the Expected Mean Payoff in Energy Markov Decision Processes , 2016, ATVA.

[29]  Krishnendu Chatterjee,et al.  Optimizing Expectation with Guarantees in POMDPs , 2017, AAAI.

[30]  Christel Baier,et al.  Principles of model checking , 2008 .

[31]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[32]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[33]  Lorenzo Clemente,et al.  Multidimensional beyond Worst-Case and Almost-Sure Problems for Mean-Payoff Objectives , 2015, 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science.

[34]  Igor Walukiewicz,et al.  Permissive strategies: from parity games to safety games , 2002, RAIRO Theor. Informatics Appl..

[35]  Moshe Y. Vardi Automatic verification of probabilistic concurrent finite state programs , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).