Safe Driving via Expert Guided Policy Optimization

When learning common skills like driving, beginners usually have 1 experienced people or domain experts standing by to ensure the safety of the 2 learning process. We formulate such learning scheme under the Expert-in-the-loop 3 Reinforcement Learning (ERL) where a guardian is introduced to safeguard the 4 exploration of the learning agent. While allowing the sufficient exploration in the 5 uncertain environment, the guardian will intervene under dangerous situations and 6 demonstrate the correct actions to avoid the potential accident. Thus ERL enables 7 both exploration and expert’s partial demonstration as two training data sources. 8 Following such new setting, we develop a novel Expert Guided Policy Optimization 9 (EGPO) method. This method integrates the guardian in the loop of reinforcement 10 learning, which is composed of an expert policy to generate demonstration and a 11 switch function to decide when to intervene. Particularly, constrained optimization 12 technique is used to tackle the trivial solution that the agent deliberately behaves 13 dangerously to deceive the expert into taking over all the time. Offline RL technique 14 is further used to learn from the partial demonstrations generated by the expert. 15 Safe driving experiments show that our method achieves superior training and 16 test-time safety, outperforms baselines with a large margin in sample efficiency, 17 and preserves the generalization capacity to unseen environments in test-time 1. 18

[1]  Improving the Generalization of End-to-End Driving through Procedural Generation , 2020, ArXiv.

[2]  Pieter Abbeel,et al.  Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[3]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[4]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[5]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[6]  Anca D. Dragan,et al.  SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards , 2019, ICLR.

[7]  Bolei Zhou,et al.  Learning a Decision Module by Imitating Driver's Control Behaviors , 2019, CoRL.

[8]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[9]  Kyunghyun Cho,et al.  Query-Efficient Imitation Learning for End-to-End Autonomous Driving , 2016, ArXiv.

[10]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[11]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[12]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[13]  Mark R. Mine,et al.  The Panda3D Graphics Engine , 2004, Computer.

[14]  Harshit Sikchi,et al.  Lyapunov Barrier Policy Optimization , 2021, ArXiv.

[15]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[16]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[17]  John Salvatier,et al.  Agent-Agnostic Human-in-the-Loop Reinforcement Learning , 2017, ArXiv.

[18]  Florian Richter,et al.  Open-Sourced Reinforcement Learning Environments for Surgical Robotics , 2019, ArXiv.

[19]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[20]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[21]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[22]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[23]  Santiago Grijalva,et al.  A Review of Reinforcement Learning for Autonomous Building Energy Management , 2019, Comput. Electr. Eng..

[24]  Junhyuk Oh,et al.  Balancing Constraints and Rewards with Meta-Gradient D4PG , 2021, ICLR.

[25]  Bernard Widrow,et al.  Pattern Recognition and Adaptive Control , 1964, IEEE Transactions on Applications and Industry.

[26]  Katherine Rose Driggs-Campbell,et al.  HG-DAgger: Interactive Imitation Learning with Human Experts , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[27]  Bolei Zhou,et al.  Neuro-Symbolic Program Search for Autonomous Driving Decision Module Design , 2020, CoRL.

[28]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[29]  David Janz,et al.  Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[30]  Yisong Yue,et al.  Learning for Safety-Critical Control with Control Barrier Functions , 2019, L4DC.

[31]  Bolei Zhou,et al.  MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning , 2021, ArXiv.

[32]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[34]  Chelsea Finn,et al.  Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings , 2020, ICML.

[35]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[36]  Sergey Levine,et al.  Learning to Walk in the Real World with Minimal Human Effort , 2020, CoRL.

[37]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[38]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[39]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[40]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[41]  Zoran Popovic,et al.  Where to Add Actions in Human-in-the-Loop Reinforcement Learning , 2017, AAAI.

[42]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[43]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[44]  S. Kambhampati,et al.  Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning , 2020, ArXiv.

[45]  Yongshuai Liu,et al.  IPO: Interior-point Policy Optimization under Constraints , 2019, AAAI.

[46]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[47]  Lantao Yu,et al.  Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model , 2021, IEEE Robotics and Automation Letters.

[48]  Martial Hebert,et al.  Learning monocular reactive UAV control in cluttered natural environments , 2012, 2013 IEEE International Conference on Robotics and Automation.