Deep Constrained Q-learning

In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a list of constraints. Classical methods based on reward shaping, i.e. a weighted combination of different objectives in the reward signal, or Lagrangian methods, including constraints in the loss function, have no guarantees that the agent satisfies the constraints at all points in time and can lead to undesired behavior. When a discrete policy is extracted from an action-value function, safe actions can be ensured by restricting the action space at maximization, but can lead to sub-optimal solutions among feasible alternatives. In this work, we propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy. In addition to single-step constraints referring only to the next action, we introduce a formulation for approximate multi-step constraints under the current target policy based on truncated value-functions. We analyze the advantages of Constrained Q-learning in the tabular case and compare Constrained DQN to reward shaping and Lagrangian methods in the application of high-level decision making in autonomous driving, considering constraints for safety, keeping right and comfort. We train our agent in the open-source simulator SUMO and on the real HighD data set.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[3]  E. Altman Constrained Markov Decision Processes , 1999 .

[4]  Helbing,et al.  Congested traffic states in empirical observations and microscopic simulations , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[5]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[6]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[7]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[9]  Matthew E. Taylor,et al.  Multi-objectivization of reinforcement learning problems by reward shaping , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[10]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[11]  Marcello Restelli,et al.  Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation , 2014, AAAI.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[14]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[15]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[16]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[17]  Kikuo Fujimura,et al.  Tactical Decision Making for Lane Changing with Deep Reinforcement Learning , 2017 .

[18]  Matthias Althoff,et al.  High-level Decision Making for Safe and Reasonable Autonomous Lane Changing using Reinforcement Learning , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[19]  Lutz Eckstein,et al.  The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[20]  Yann LeCun,et al.  Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic , 2019, ICLR.

[21]  Jaime F. Fisac,et al.  Safely Probabilistically Complete Real-Time Planning and Exploration in Unknown Environments , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[23]  Gabriel Kalweit,et al.  Composite Q-learning: Multi-scale Q-function Decomposition and Separable Optimization , 2019 .

[24]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[25]  Gabriel Kalweit,et al.  Dynamic Input for Deep Reinforcement Learning in Autonomous Driving , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[27]  J. Boedecker,et al.  Off-policy Multi-step Q-learning , 2019, ArXiv.

[28]  Anca D. Dragan,et al.  A Scalable Framework For Real-Time Multi-Robot, Multi-Human Collision Avoidance , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[29]  Xueyuan Li,et al.  Tactical driving decisions of unmanned ground vehicles in complex highway environments: A deep reinforcement learning approach: , 2020 .

[30]  Gabriel Kalweit,et al.  Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).