Logically-Correct Reinforcement Learning

We propose a novel Reinforcement Learning (RL) algorithm to synthesize policies for a Markov Decision Process (MDP), such that a linear time property is satisfied. We convert the property into a Limit Deterministic Buchi Automaton (LDBA), then construct a product MDP between the automaton and the original MDP. A reward function is then assigned to the states of the product automaton, according to accepting conditions of the LDBA. With this reward function, RL synthesizes a policy that satisfies the property: as such, the policy synthesis procedure is "constrained" by the given specification. Additionally, we show that the RL procedure sets up an online value iteration method to calculate the maximum probability of satisfying the given property, at any given state of the MDP - a convergence proof for the procedure is provided. Finally, the performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches.

[1]  Nir Piterman From Nondeterministic Büchi and Streett Automata to Deterministic Parity Automata , 2007, Log. Methods Comput. Sci..

[2]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[3]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[4]  E. Feinberg,et al.  An Inequality for Variances of the Discounted Rewards , 2009, Journal of Applied Probability.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Rajeev Alur,et al.  Deterministic generators and games for LTL fragments , 2001, Proceedings 16th Annual IEEE Symposium on Logic in Computer Science.

[7]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[8]  Marco Wiering,et al.  Reinforcement Learning and Markov Decision Processes , 2012, Reinforcement Learning.

[9]  Orna Kupferman,et al.  Model Checking of Safety Properties , 1999, Formal Methods Syst. Des..

[10]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[11]  Wei Ren,et al.  Game theory control solution for sensor coverage problem in unknown environment , 2014, 53rd IEEE Conference on Decision and Control.

[12]  Joost-Pieter Katoen,et al.  Quantitative model-checking of controlled discrete-time Markov processes , 2014, Inf. Comput..

[13]  Jan Kretínský,et al.  MoChiBA: Probabilistic LTL Model Checking Using Limit-Deterministic Büchi Automata , 2016, ATVA.

[14]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Jan Kretínský,et al.  Limit-Deterministic Büchi Automata for Linear Temporal Logic , 2016, CAV.

[16]  Calin Belta,et al.  Optimal path planning for surveillance with temporal-logic constraints* , 2011, Int. J. Robotics Res..

[17]  Marta Z. Kwiatkowska,et al.  PRISM 4.0: Verification of Probabilistic Real-Time Systems , 2011, CAV.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19]  Ufuk Topcu,et al.  Robust control of uncertain Markov Decision Processes with temporal logic specifications , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[20]  Christel Baier,et al.  Principles of model checking , 2008 .

[21]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[22]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Mohammadhosein Hasanbeig,et al.  On Synchronous Binary Log-Linear Learning and Second Order Q-learning , 2017 .

[25]  S. Safra,et al.  On the complexity of omega -automata , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[26]  Calin Belta,et al.  Optimal control of MDPs with temporal logic constraints , 2013, 52nd IEEE Conference on Decision and Control.

[27]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[28]  S. Shankar Sastry,et al.  A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications , 2014, 53rd IEEE Conference on Decision and Control.

[29]  Sebastian Junges,et al.  Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[30]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[31]  Krishnendu Chatterjee,et al.  Value Iteration for Long-Run Average Reward in Markov Decision Processes , 2017, CAV.

[32]  R. Durrett Essentials of Stochastic Processes , 1999 .

[33]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[34]  Xu Chu Ding,et al.  Strategic planning under uncertainties via constrained Markov Decision Processes , 2013, 2013 IEEE International Conference on Robotics and Automation.

[35]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36]  Ufuk Topcu,et al.  Correct-by-synthesis reinforcement learning with temporal logic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  Marta Z. Kwiatkowska,et al.  Pareto Curves for Probabilistic Model Checking , 2012, ATVA.

[38]  Calin Belta,et al.  Motion planning and control from temporal logic specifications with probabilistic satisfaction guarantees , 2010, 2010 IEEE International Conference on Robotics and Automation.

[39]  Mohammadhosein Hasanbeig,et al.  Multi-agent Learning in Coverage Control Games , 2016 .

[40]  Amir Pnueli,et al.  The temporal logic of programs , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[41]  Patrick Doherty,et al.  Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization , 2015, AAAI.