Interpretable Multi Time-scale Constraints in Model-free Deep Reinforcement Learning for Autonomous Driving

In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a list of constraints. Classical methods based on reward shaping, i.e. a weighted combination of different objectives in the reward signal, or Lagrangian methods, including constraints in the loss function, have no guarantees that the agent satisfies the constraints at all points in time and lack in interpretability. When a discrete policy is extracted from an action-value function, safe actions can be ensured by restricting the action space at maximization, but can lead to sub-optimal solutions among feasible alternatives. In this work, we propose Multi Time-scale Constrained DQN, a novel algorithm restricting the action space directly in the Q-update to learn the optimal Q-function for the constrained MDP and the corresponding safe policy. In addition to single-step constraints referring only to the next action, we introduce a formulation for approximate multi-step constraints under the current target policy based on truncated value-functions to enhance interpretability. We compare our algorithm to reward shaping and Lagrangian methods in the application of high-level decision making in autonomous driving, considering constraints for safety, keeping right and comfort. We train our agent in the open-source simulator SUMO and on the real HighD data set.

[1]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[2]  Xueyuan Li,et al.  Tactical driving decisions of unmanned ground vehicles in complex highway environments: A deep reinforcement learning approach: , 2020 .

[3]  Moritz Werling,et al.  Reinforcement Learning for Autonomous Maneuvering in Highway Scenarios , 2017 .

[4]  Matthew E. Taylor,et al.  Multi-objectivization of reinforcement learning problems by reward shaping , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[5]  Kikuo Fujimura,et al.  Tactical Decision Making for Lane Changing with Deep Reinforcement Learning , 2017 .

[6]  Gabriel Kalweit,et al.  Dynamic Input for Deep Reinforcement Learning in Autonomous Driving , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[7]  E. Altman Constrained Markov Decision Processes , 1999 .

[8]  Gabriel Kalweit,et al.  Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[10]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[11]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[12]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Yann LeCun,et al.  Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic , 2019, ICLR.

[16]  Jaime F. Fisac,et al.  Safely Probabilistically Complete Real-Time Planning and Exploration in Unknown Environments , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[17]  Gabriel Kalweit,et al.  Off-policy Multi-step Q-learning , 2019, ArXiv.

[18]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Lutz Eckstein,et al.  The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[23]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[24]  Matthias Althoff,et al.  High-level Decision Making for Safe and Reasonable Autonomous Lane Changing using Reinforcement Learning , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[25]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[26]  Anca D. Dragan,et al.  A Scalable Framework For Real-Time Multi-Robot, Multi-Human Collision Avoidance , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[27]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[28]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[29]  Marcello Restelli,et al.  Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation , 2014, AAAI.