Omega-Regular Objectives in Model-Free Reinforcement Learning

We provide the first solution for model-free reinforcement learning of \(\omega \)-regular objectives for Markov decision processes (MDPs). We present a constructive reduction from the almost-sure satisfaction of \(\omega \)-regular objectives to an almost-sure reachability problem, and extend this technique to learning how to control an unknown model so that the chance of satisfying the objective is maximized. We compile \(\omega \)-regular properties into limit-deterministic Buchi automata instead of the traditional Rabin automata; this choice sidesteps difficulties that have marred previous proposals. Our approach allows us to apply model-free, off-the-shelf reinforcement learning algorithms to compute optimal strategies from the observations of the MDP. We present an experimental evaluation of our technique on benchmark learning problems.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Moshe Y. Vardi Automatic verification of probabilistic concurrent finite state programs , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[3]  Calin Belta,et al.  Temporal Logic Motion Planning and Control With Probabilistic Satisfaction Guarantees , 2012, IEEE Transactions on Robotics.

[4]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[5]  Wolfgang Thomas,et al.  Automata on Infinite Objects , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[8]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[9]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[10]  Jan Kretínský,et al.  Limit-Deterministic Büchi Automata for Linear Temporal Logic , 2016, CAV.

[11]  Christel Baier,et al.  Principles of model checking , 2008 .

[12]  Jan Kretínský,et al.  MoChiBA: Probabilistic LTL Model Checking Using Limit-Deterministic Büchi Automata , 2016, ATVA.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Olivier Carton,et al.  Computing the Rabin Index of a Parity Automaton , 1999, RAIRO Theor. Informatics Appl..

[15]  Zohar Manna,et al.  Formal verification of probabilistic systems , 1997 .

[16]  Toshimitsu Ushio,et al.  Learning an Optimal Control Policy for a Markov Decision Process Under Linear Temporal Logic Specifications , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[17]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[18]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[19]  Lijun Zhang,et al.  Lazy Probabilistic Model Checking without Determinisation , 2013, CONCUR.

[20]  W. Marsden I and J , 2012 .

[21]  Zohar Manna,et al.  The Temporal Logic of Reactive and Concurrent Systems , 1991, Springer New York.

[22]  Jean-Eric Pin,et al.  Infinite words - automata, semigroups, logic and games , 2004, Pure and applied mathematics series.

[23]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[24]  Jan Kretínský,et al.  The Hanoi Omega-Automata Format , 2015, CAV.

[25]  Krishnendu Chatterjee,et al.  Automata with Generalized Rabin Pairs for Probabilistic Model Checking and LTL Synthesis , 2013, CAV.

[26]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[27]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[28]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[29]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[30]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[31]  S. Shankar Sastry,et al.  A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications , 2014, 53rd IEEE Conference on Decision and Control.

[32]  Daniel Kroening,et al.  Logically-Correct Reinforcement Learning , 2018, ArXiv.

[33]  Daniel Kroening,et al.  Certified Reinforcement Learning with Logic Guidance , 2019, Artif. Intell..

[34]  太田 純,et al.  Old Possum's Book of Practical Catsとライト・ヴァースの伝統 , 1998 .

[35]  Krishnendu Chatterjee,et al.  Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[36]  Robert K. Brayton,et al.  The Rabin Index and Chain Automata, with Applications to Automatas and Games , 1995, CAV.

[37]  Eugene A. Feinberg,et al.  Handbook of Markov Decision Processes , 2002 .

[38]  Mihalis Yannakakis,et al.  The complexity of probabilistic verification , 1995, JACM.

[39]  Amir Pnueli,et al.  Verification of multiprocess probabilistic protocols , 2005, Distributed Computing.

[40]  Marta Z. Kwiatkowska,et al.  PRISM 4.0: Verification of Probabilistic Real-Time Systems , 2011, CAV.