Modular Deep Reinforcement Learning with Temporal Logic Specifications

We propose an actor-critic, model-free, and online Reinforcement Learning (RL) framework for continuous-state continuous-action Markov Decision Processes (MDPs) when the reward is highly sparse but encompasses a high-level temporal structure. We represent this temporal structure by a finite-state machine and construct an on-the-fly synchronised product with the MDP and the finite machine. The temporal structure acts as a guide for the RL agent within the product, where a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy. We evaluate our framework in a Mars rover experiment and we present the success rate of the synthesised policy.

[1]  Jan Kretínský,et al.  MoChiBA: Probabilistic LTL Model Checking Using Limit-Deterministic Büchi Automata , 2016, ATVA.

[2]  Andrew Gordon Wilson,et al.  Improving Consistency-Based Semi-Supervised Learning with Weight Averaging , 2018, ArXiv.

[4]  Jan Kretínský,et al.  Limit-Deterministic Büchi Automata for Linear Temporal Logic , 2016, CAV.

[5]  Christel Baier,et al.  Principles of model checking , 2008 .

[6]  S. Shankar Sastry,et al.  A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications , 2014, 53rd IEEE Conference on Decision and Control.

[7]  Joost-Pieter Katoen,et al.  Approximate Model Checking of Stochastic Hybrid Systems , 2010, Eur. J. Control.

[8]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[9]  Giuseppe De Giacomo,et al.  Foundations for Restraining Bolts: Reinforcement Learning with LTLf/LDLf Restraining Specifications , 2018, ICAPS.

[10]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[11]  Alessandro Abate,et al.  Quantitative Approximation of the Probability Distribution of a Markov Process by Formal Abstractions , 2015, Log. Methods Comput. Sci..

[12]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[13]  Chih-Hong Cheng,et al.  Formal consistency checking over specifications in natural languages , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Alessandro Abate,et al.  StocHy: automated verification and synthesis of stochastic processes , 2019, ArXiv.

[15]  Daniel Kroening,et al.  Logically-Constrained Neural Fitted Q-Iteration , 2018, AAMAS.

[16]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[17]  Daniela Fischer,et al.  Digital Design And Computer Architecture , 2016 .

[18]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[19]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[20]  A. G. Wilson,et al.  Improving Stability in Deep Reinforcement Learning with Weight Averaging , 2018 .

[21]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[22]  Ufuk Topcu,et al.  Robust control of uncertain Markov Decision Processes with temporal logic specifications , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[23]  Nicolas Thomas,et al.  Recurring slope lineae in equatorial regions of Mars , 2014 .

[24]  Alessandro Abate,et al.  FAUST 2 : Formal Abstractions of Uncountable-STate STochastic Processes , 2014, TACAS.

[25]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[28]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[29]  Fred Kröger,et al.  Temporal Logic of Programs , 1987, EATCS Monographs on Theoretical Computer Science.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[32]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[33]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[34]  Allen P. Nikora,et al.  Automated Identification of LTL Patterns in Natural Language Requirements , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[35]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.