Foundations for Restraining Bolts: Reinforcement Learning with LTLf/LDLf Restraining Specifications

In this work we investigate on the concept of “restraining bolt”, envisioned in Science Fiction. Specifically we introduce a novel problem in AI. We have two distinct sets of features extracted from the world, one by the agent and one by the authority imposing restraining specifications (the “restraining bolt”). The two sets are apparently unrelated since of interest to independent parties, however they both account for (aspects of) the same world. We consider the case in which the agent is a reinforcement learning agent on the first set of features, while the restraining bolt is specified logically using linear time logic on finite traces LTLf/LDLf over the second set of features. We show formally, and illustrate with examples, that, under general circumstances, the agent can learn while shaping its goals to suitably conform (as much as possible) to the restraining bolt specifications.

[1]  Scott Sanner,et al.  Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping , 2021, SOCS.

[2]  Dana Fisman,et al.  Learning Regular Languages via Alternating Automata , 2015, IJCAI.

[3]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[4]  Giuseppe De Giacomo,et al.  Linear Temporal Logic and Linear Dynamic Logic on Finite Traces , 2013, IJCAI.

[5]  Wil M. P. van der Aalst,et al.  DECLARE: Full Support for Loosely-Structured Processes , 2007, 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007).

[6]  John K. Slaney,et al.  Semipositive LTL with an Uninterpreted Past Operator , 2005, Log. J. IGPL.

[7]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[8]  Frits W. Vaandrager,et al.  Model learning , 2017, Commun. ACM.

[9]  Fred Kröger,et al.  Temporal Logic of Programs , 1987, EATCS Monographs on Theoretical Computer Science.

[10]  Nick Hawes,et al.  Optimal and dynamic planning for Markov decision processes with co-safe LTL specifications , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Craig Boutilier,et al.  Rewarding Behaviors , 1996, AAAI/IAAI, Vol. 2.

[12]  Giuseppe De Giacomo,et al.  Automata-Theoretic Foundations of FOND Planning for LTLf and LDLf Goals , 2018, IJCAI.

[13]  Dana S. Scott,et al.  Finite Automata and Their Decision Problems , 1959, IBM J. Res. Dev..

[14]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[15]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[16]  John K. Slaney,et al.  Decision-Theoretic Planning with non-Markovian Rewards , 2011, J. Artif. Intell. Res..

[17]  Hector J. Levesque,et al.  GOLOG: A Logic Programming Language for Dynamic Domains , 1997, J. Log. Program..

[18]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[19]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[20]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[21]  Sheila A. McIlraith,et al.  Monitoring Plan Optimality During Execution , 2007, ICAPS.

[22]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Charles Gretton,et al.  A More Expressive Behavioral Logic for Decision-Theoretic Planning , 2014, PRICAI.

[25]  Charles Gretton Gradient-Based Relational Reinforcement Learning of Temporally Extended Policies , 2007, ICAPS.

[26]  Raymond Reiter,et al.  Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical Systems , 2001 .

[27]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[28]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[29]  Orna Kupferman,et al.  On High-Quality Synthesis , 2016, CSR.

[30]  Alberto Camacho Decision-Making with Non-Markovian Rewards: From LTL to automata-based reward shaping , 2017 .

[31]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[32]  Ufuk Topcu,et al.  Correct-by-synthesis reinforcement learning with temporal logic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[34]  Ronen I. Brafman,et al.  LTLf/LDLf Non-Markovian Rewards , 2018, AAAI.

[35]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[36]  John G. Gibbons Knowledge in Action , 2001 .

[37]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[38]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[39]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[40]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[41]  Jorge A. Baier,et al.  Beyond Classical Planning: Procedural Control Knowledge and Preferences in State-of-the-Art Planners , 2008, AAAI.

[42]  Nick Hawes,et al.  Optimal Policy Generation for Partially Satisfiable Co-Safe LTL Specifications , 2015, IJCAI.

[43]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[44]  Orna Kupferman,et al.  Formally Reasoning About Quality , 2016, J. ACM.