A formal methods approach to interpretable reinforcement learning for robotic planning

A formal methods approach to reinforcement learning generates rewards from a formal language and guarantees safety. Growing interest in reinforcement learning approaches to robotic planning and control raises concerns of predictability and safety of robot behaviors realized solely through learned control policies. In addition, formally defining reward functions for complex tasks is challenging, and faulty rewards are prone to exploitation by the learning agent. Here, we propose a formal methods approach to reinforcement learning that (i) provides a formal specification language that integrates high-level, rich, task specifications with a priori, domain-specific knowledge; (ii) makes the reward generation process easily interpretable; (iii) guides the policy generation process according to the specification; and (iv) guarantees the satisfaction of the (critical) safety component of the specification. The main ingredients of our computational framework are a predicate temporal logic specifically tailored for robotic tasks and an automaton-guided, safe reinforcement learning algorithm based on control barrier functions. Although the proposed framework is quite general, we motivate it and illustrate it experimentally for a robotic cooking task, in which two manipulators worked together to make hot dogs.

[1]  Cuntai Guan,et al.  A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[4]  Scott Sanner,et al.  Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping , 2021, SOCS.

[5]  Craig Boutilier,et al.  Structured Solution Methods for Non-Markovian Decision Processes , 1997, AAAI/IAAI.

[6]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[7]  Dejan Nickovic,et al.  Monitoring Temporal Properties of Continuous Signals , 2004, FORMATS/FTRTFT.

[8]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[9]  John K. Slaney,et al.  Decision-Theoretic Planning with non-Markovian Rewards , 2011, J. Artif. Intell. Res..

[10]  Matthias Scheutz,et al.  Value Alignment or Misalignment - What Will Keep Systems Accountable? , 2017, AAAI Workshops.

[11]  Christel Baier,et al.  Principles of model checking , 2008 .

[12]  Giuseppe De Giacomo,et al.  Foundations for Restraining Bolts: Reinforcement Learning with LTLf/LDLf Restraining Specifications , 2018, ICAPS.

[13]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[14]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Koushil Sreenath,et al.  Discrete Control Barrier Functions for Safety-Critical Control of Discrete Systems with Application to Bipedal Robot Navigation , 2017, Robotics: Science and Systems.

[16]  Sven Schewe,et al.  Omega-Regular Objectives in Model-Free Reinforcement Learning , 2018, TACAS.

[17]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[18]  Radu Calinescu,et al.  Assured Reinforcement Learning with Formally Verified Abstract Policies , 2017, ICAART.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[21]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[22]  Jyotirmoy V. Deshmukh,et al.  Structured reward functions using STL: poster abstract , 2019, HSCC.

[23]  Michael M. Zavlanos,et al.  Reduced variance deep reinforcement learning with temporal logic specifications , 2019, ICCPS.

[24]  Calin Belta,et al.  Receding horizon surveillance with temporal logic specifications , 2010, 49th IEEE Conference on Decision and Control (CDC).

[25]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[26]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[27]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[28]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[29]  Craig Boutilier,et al.  Rewarding Behaviors , 1996, AAAI/IAAI, Vol. 2.

[30]  Calin Belta,et al.  Q-Learning for robust satisfaction of signal temporal logic specifications , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[31]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[32]  Moshe Y. Vardi,et al.  Explicit or symbolic translation of linear temporal logic to automata , 2012 .

[33]  Timo Latvala,et al.  Efficient Model Checking of Safety Properties , 2003, SPIN.

[34]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[35]  Paulo Tabuada,et al.  Control barrier function based quadratic programs with application to adaptive cruise control , 2014, 53rd IEEE Conference on Decision and Control.

[36]  Li Wang,et al.  Barrier-Certified Adaptive Reinforcement Learning With Applications to Brushbot Navigation , 2018, IEEE Transactions on Robotics.

[37]  Kiran Vodrahalli,et al.  Learning to Plan with Logical Automata , 2019, Robotics: Science and Systems.

[38]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[39]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[40]  Petter Nilsson,et al.  Barrier Functions: Bridging the Gap between Planning from Specifications and Safety-Critical Control , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Ufuk Topcu,et al.  Learning from Demonstrations with High-Level Side Information , 2017, IJCAI.