Model-based Reinforcement Learning from Signal Temporal Logic Specifications

Techniques based on Reinforcement Learning (RL) are increasingly being used to design control policies for robotic systems. RL fundamentally relies on state-based reward functions to encode desired behavior of the robot and bad reward functions are prone to exploitation by the learning agent, leading to behavior that is undesirable in the best case and critically dangerous in the worst. On the other hand, designing good reward functions for complex tasks is a challenging problem. In this paper, we propose expressing desired high-level robot behavior using a formal specification language known as Signal Temporal Logic (STL) as an alternative to reward/cost functions. We use STL specifications in conjunction with model-based learning to design model predictive controllers that try to optimize the satisfaction of the STL specification over a finite time horizon. The proposed algorithm is empirically evaluated on simulations of robotic system such as a pick-and-place robotic arm, and adaptive cruise control for autonomous vehicles.

[1]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[2]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[4]  Hadas Kress-Gazit,et al.  Automated synthesis of decentralized controllers for robot swarms from high-level temporal logic specifications , 2020, Auton. Robots.

[5]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[6]  Sanjit A. Seshia,et al.  Automotive systems requirement mining using breach , 2015, 2015 American Control Conference (ACC).

[7]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[10]  M. Kothare,et al.  Robust constrained model predictive control using linear matrix inequalities , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[11]  Daniel King,et al.  Fetch & Freight : Standard Platforms for Service Robot Applications , 2016 .

[12]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[13]  Calin Belta,et al.  Q-Learning for robust satisfaction of signal temporal logic specifications , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[14]  A. Mesbah,et al.  Stochastic Model Predictive Control: An Overview and Perspectives for Future Research , 2016, IEEE Control Systems.

[15]  Dimitra Panagou,et al.  Control-Lyapunov and Control-Barrier Functions based Quadratic Program for Spatio-temporal Specifications , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[16]  Daniel Kroening,et al.  Logically-Constrained Reinforcement Learning , 2018, 1801.08099.

[17]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[18]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[19]  Oded Maler,et al.  Robust Satisfaction of Temporal Logic over Real-Valued Signals , 2010, FORMATS.

[20]  G. Martin,et al.  Nonlinear model predictive control , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[21]  Calin Belta,et al.  Control from Signal Temporal Logic Specifications with Smooth Cumulative Quantitative Semantics , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[22]  Jyotirmoy V. Deshmukh,et al.  Structured Reward Shaping using Signal Temporal Logic specifications , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Vasumathi Raman,et al.  Robust Model Predictive Control for Signal Temporal Logic Synthesis , 2015, ADHS.

[24]  Dirk P. Kroese,et al.  The cross-entropy method for estimation , 2013 .

[25]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[26]  Li Wang,et al.  Barrier-Certified Adaptive Reinforcement Learning With Applications to Brushbot Navigation , 2018, IEEE Transactions on Robotics.

[27]  Carlos Bordons Alba,et al.  Model Predictive Control , 2012 .

[28]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29]  J. Kocijan,et al.  Gaussian process model based predictive control , 2004, Proceedings of the 2004 American Control Conference.

[30]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[31]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[32]  Sanjit A. Seshia,et al.  Mining Requirements From Closed-Loop Control Models , 2015, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[33]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[34]  Dejan Nickovic,et al.  Specification-Based Monitoring of Cyber-Physical Systems: A Survey on Theory, Tools and Applications , 2018, Lectures on Runtime Verification.

[35]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[36]  J. How,et al.  Mixed-integer programming for control , 2005, Proceedings of the 2005, American Control Conference, 2005..

[37]  Dimos V. Dimarogonas,et al.  Control Barrier Functions for Signal Temporal Logic Tasks , 2019, IEEE Control Systems Letters.

[38]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[39]  Houssam Abbas,et al.  Fly-by-Logic: Control of Multi-Drone Fleets with Temporal Logic Objectives , 2018, 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS).

[40]  John Langford,et al.  Efficient Exploration in Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[41]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[42]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[43]  Duy Nguyen-Tuong,et al.  Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[44]  Dirk P. Kroese,et al.  Chapter 3 – The Cross-Entropy Method for Optimization , 2013 .

[45]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[46]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[47]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[48]  Frank Allgöwer,et al.  Nonlinear Model Predictive Control , 2007 .

[49]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[50]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[51]  Li Wang,et al.  Safe Learning of Quadrotor Dynamics Using Barrier Certificates , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Marko Bacic,et al.  Model predictive control , 2003 .

[53]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[55]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[56]  Calin Belta,et al.  Motion planning and control from temporal logic specifications with probabilistic satisfaction guarantees , 2010, 2010 IEEE International Conference on Robotics and Automation.

[57]  David Q. Mayne,et al.  Correction to "Constrained model predictive control: stability and optimality" , 2001, Autom..

[58]  Ashish Kapoor,et al.  Safe Control under Uncertainty with Probabilistic Signal Temporal Logic , 2016, Robotics: Science and Systems.

[59]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[60]  Sergey Levine,et al.  One-shot learning of manipulation skills with online dynamics adaptation and neural network priors , 2015, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).