Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks

Reinforcement learning (RL) for robotics is challenging due to the difficulty in hand-engineering a dense cost function, which can lead to unintended behavior, and dynamical uncertainty, which makes exploration and constraint satisfaction challenging. We address these issues with a new model-based reinforcement learning algorithm, Safety Augmented Value Estimation from Demonstrations (SAVED), which uses supervision that only identifies task completion and a modest set of suboptimal demonstrations to constrain exploration and learn efficiently while handling complex constraints. We then compare SAVED with 3 state-of-the-art model-based and model-free RL algorithms on 6 standard simulation benchmarks involving navigation and manipulation and a physical knot-tying task on the da Vinci surgical robot. Results suggest that SAVED outperforms prior methods in terms of success rate, constraint satisfaction, and sample efficiency, making it feasible to safely learn a control policy directly on a real robot in less than an hour. For tasks on the robot, baselines succeed less than <inline-formula><tex-math notation="LaTeX">$\text{5}\%$</tex-math></inline-formula> of the time while SAVED has a success rate of over <inline-formula><tex-math notation="LaTeX">$\text{75}\%$</tex-math></inline-formula> in the first 50 training iterations. Code and supplementary material is available at <uri>https://tinyurl.com/saved-rl</uri>.

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  Oussama Khatib,et al.  Elastic bands: connecting path planning and control , 1993, [1993] Proceedings IEEE International Conference on Robotics and Automation.

[3]  Si-Zhao Joe Qin,et al.  A two-stage iterative learning control technique combined with real-time feedback for independent disturbance rejection , 2004, Autom..

[4]  A.G. Alleyne,et al.  A survey of iterative learning control , 2006, IEEE Control Systems.

[5]  Jay H. Lee,et al.  ITERATIVE LEARNING CONTROL APPLIED TO BATCH PROCESSES: AN OVERVIEW , 2006 .

[6]  Oliver Purwin,et al.  Performing aggressive maneuvers using iterative learning control , 2009, 2009 IEEE International Conference on Robotics and Automation.

[7]  Pieter Abbeel,et al.  Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations , 2010, 2010 IEEE International Conference on Robotics and Automation.

[8]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[9]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[10]  Arkadi Nemirovski,et al.  On safe tractable approximations of chance constraints , 2012, Eur. J. Oper. Res..

[11]  Takayuki Osogami,et al.  Robustness and risk-sensitivity in Markov decision processes , 2012, NIPS.

[12]  Pieter Abbeel,et al.  Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds , 2012, NIPS.

[13]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[14]  Peter Kazanzides,et al.  An open-source research kit for the da Vinci® Surgical System , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Masayoshi Tomizuka,et al.  Robust principal component analysis for iterative learning control of precision motion systems with non-repetitive disturbances , 2015, 2015 American Control Conference (ACC).

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[18]  J. Christian Gerdes,et al.  Design of a feedback-feedforward steering controller for accurate path tracking and stability at the limits of handling , 2015 .

[19]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[20]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[21]  Francesco Borrelli,et al.  Learning Model Predictive Control for Iterative Tasks , 2016, ArXiv.

[22]  Sergey Levine,et al.  One-shot learning of manipulation skills with online dynamics adaptation and neural network priors , 2015, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[24]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[25]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[26]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[27]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[28]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[29]  Alberto Bemporad,et al.  Predictive Control for Linear and Hybrid Systems , 2017 .

[30]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[31]  Justin Fu,et al.  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[32]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[33]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[34]  Francesco Borrelli,et al.  Autonomous racing using learning Model Predictive Control , 2016, 2017 American Control Conference (ACC).

[35]  Xiaojing Zhang,et al.  A Stochastic MPC Approach with Application to Iterative Learning , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[36]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[37]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[38]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[39]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[40]  Tianshu Chu,et al.  Safe Reinforcement Learning: Learning with Supervision Using a Constraint-Admissible Set , 2018, 2018 Annual American Control Conference (ACC).

[41]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[44]  John F. Canny,et al.  Fast and Reliable Autonomous Surgical Debridement with Cable-Driven Robots Using a Two-Phase Calibration Procedure , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Francesco Borrelli,et al.  Learning Model Predictive Control for Iterative Tasks. A Data-Driven Control Framework , 2016, IEEE Transactions on Automatic Control.

[46]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[47]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[48]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[49]  Brijen Thananjeyan,et al.  On-Policy Robot Imitation Learning from a Converging Supervisor , 2019, CoRL.

[50]  Benjamin Recht,et al.  The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint , 2018, COLT.

[51]  Francesco Borrelli,et al.  Sample-Based Learning Model Predictive Control for Linear Uncertain Systems , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[52]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.