论文信息 - Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all $\epsilon$ and $\gamma$ we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an $\epsilon$-optimal mean payoff with probability at least $1 - \gamma$. (ii) Alternatively, for all $\epsilon$ and $\gamma$ there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an $\epsilon$-optimal mean payoff with probability at least $1 - \gamma$. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

[1] Mathieu Tracol,et al. Fast convergence to state-action frequency polytopes for MDPs , 2009, Oper. Res. Lett..

[2] Mihalis Yannakakis,et al. Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[3] Sebastian Junges,et al. Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[4] Eilon Solan. Continuity of the Value of Competitive Markov Decision Processes , 2003 .

[5] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.

[6] Benjamin Aminof,et al. First-cycle games , 2014, Inf. Comput..

[7] Krishnendu Chatterjee,et al. Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[8] Krzysztof R. Apt,et al. Lectures in Game Theory for Computer Scientists , 2011 .

[9] Kim G. Larsen,et al. On Time with Minimal Expected Cost! , 2014, ATVA.

[10] Mickael Randour,et al. Threshold Constraints with Guarantees for Parity Objectives in Markov Decision Processes , 2017, ICALP.

[11] Wolfgang Thomas,et al. On the Synthesis of Strategies in Infinite Games , 1995, STACS.

[12] Krishnendu Chatterjee. Concurrent games with tail objectives , 2007, Theor. Comput. Sci..

[13] Thomas A. Henzinger,et al. Faster Statistical Model Checking for Unbounded Temporal Properties , 2016, TACAS.

[14] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[15] Peter Winkler,et al. Exact Mixing in an Unknown Markov Chain , 1995, Electron. J. Comb..

[16] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[17] Krishnendu Chatterjee,et al. Energy and Mean-Payoff Parity Markov Decision Processes , 2011, MFCS.

[18] Cristian S. Calude,et al. Deciding parity games in quasipolynomial time , 2017, STOC.

[19] Thomas G. Dietterich. Adaptive computation and machine learning , 1998 .

[20] Véronique Bruyère,et al. Meet Your Expectations With Guarantees: Beyond Worst-Case Synthesis in Quantitative Games , 2013, STACS.

[21] Benjamin Aminof,et al. First-cycle games , 2017, Inf. Comput..

[22] Orna Kupferman,et al. Minimizing Expected Cost Under Hard Boolean Constraints, with Applications to Quantitative Synthesis , 2016, CONCUR.

[23] Krishnendu Chatterjee,et al. Robustness of Structurally Equivalent Concurrent Parity Games , 2011, FoSSaCS.

[24] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25] Ufuk Topcu,et al. Correct-by-synthesis reinforcement learning with temporal logic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26] Hugo Gimbert,et al. Pure Stationary Optimal Strategies in Markov Decision Processes , 2007, STACS.

[27] Ufuk Topcu,et al. Probably Approximately Correct Learning in Stochastic Games with Temporal Logic Specifications , 2016, IJCAI.

[28] Petr Novotný,et al. Optimizing the Expected Mean Payoff in Energy Markov Decision Processes , 2016, ATVA.

[29] Krishnendu Chatterjee,et al. Optimizing Expectation with Guarantees in POMDPs , 2017, AAAI.

[30] Christel Baier,et al. Principles of model checking , 2008 .

[31] Ufuk Topcu,et al. Safe Reinforcement Learning via Shielding , 2017, AAAI.

[32] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[33] Lorenzo Clemente,et al. Multidimensional beyond Worst-Case and Almost-Sure Problems for Mean-Payoff Objectives , 2015, 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science.

[34] Igor Walukiewicz,et al. Permissive strategies: from parity games to safety games , 2002, RAIRO Theor. Informatics Appl..

[35] Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite state programs , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).