Deep Reinforcement Learning with Temporal Logics

The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Buchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.

[1]  Fabio Somenzi,et al.  Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning , 2020, 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS).

[2]  Daniel Kroening,et al.  Cautious Reinforcement Learning with Logical Constraints , 2020, AAMAS.

[3]  Calin Belta,et al.  A Policy Search Method For Temporal Logic Specified Reinforcement Learning Tasks , 2018, 2018 Annual American Control Conference (ACC).

[4]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[5]  Daniel Kroening,et al.  Modular Deep Reinforcement Learning with Temporal Logic Specifications , 2019, ArXiv.

[6]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7]  Giuseppe De Giacomo,et al.  Foundations for Restraining Bolts: Reinforcement Learning with LTLf/LDLf Restraining Specifications , 2018, ICAPS.

[8]  Sven Schewe,et al.  Omega-Regular Objectives in Model-Free Reinforcement Learning , 2018, TACAS.

[9]  Chih-Hong Cheng,et al.  Formal consistency checking over specifications in natural languages , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[11]  Xianping Guo,et al.  Markov decision processes with state-dependent discount factors and unbounded rewards/costs , 2011, Oper. Res. Lett..

[12]  T. J. McCoy,et al.  Exploration of Victoria Crater by the Mars Rover Opportunity , 2009, Science.

[13]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[14]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[15]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[16]  Nathan Fulton,et al.  Verifiably Safe Off-Model Reinforcement Learning , 2019, TACAS.

[17]  S. Shankar Sastry,et al.  A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications , 2014, 53rd IEEE Conference on Decision and Control.

[18]  Sebastian Junges,et al.  Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[19]  Naoto Yoshida,et al.  Reinforcement learning with state-dependent discount factor , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[20]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[21]  Nathan Fulton,et al.  Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning , 2018, AAAI.

[22]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[23]  Toshimitsu Ushio,et al.  Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Büchi Automata , 2020, IEEE Control Systems Letters.

[24]  Amir Pnueli,et al.  The temporal logic of programs , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[25]  Daniel Kroening,et al.  Logically-Constrained Neural Fitted Q-Iteration , 2018, AAMAS.

[26]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[27]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[28]  Silviu Pitis,et al.  Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach , 2019, AAAI.

[29]  Giuseppe De Giacomo,et al.  Imitation Learning over Heterogeneous Agents with Restraining Bolts , 2020, ICAPS.

[30]  Daniel Kroening,et al.  Logically-Constrained Reinforcement Learning , 2018, 1801.08099.

[31]  Tom Melham,et al.  DeepSynth: Program Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning , 2019, ArXiv.

[32]  Allen P. Nikora,et al.  Automated Identification of LTL Patterns in Natural Language Requirements , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[33]  Nicolas Thomas,et al.  Recurring slope lineae in equatorial regions of Mars , 2014 .

[34]  Krishnendu Chatterjee,et al.  Verification of Markov Decision Processes Using Learning Algorithms , 2014, ATVA.

[35]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[36]  Jan Kretínský,et al.  MoChiBA: Probabilistic LTL Model Checking Using Limit-Deterministic Büchi Automata , 2016, ATVA.

[37]  Joost-Pieter Katoen,et al.  Approximate Model Checking of Stochastic Hybrid Systems , 2010, Eur. J. Control.

[38]  Stephen J. Roberts,et al.  Safe Policy Search Using Gaussian Process Models , 2019, AAMAS.

[39]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[40]  Daniel Kroening,et al.  Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[41]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[42]  Jan Kretínský,et al.  Limit-Deterministic Büchi Automata for Linear Temporal Logic , 2016, CAV.

[43]  Sebastian Junges,et al.  Shielded Decision-Making in MDPs , 2018, ArXiv.

[44]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[45]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[46]  W. Pizer,et al.  Discounting the Distant Future: How Much Do Uncertain Rates Increase Valuations? , 2001 .