Balancing Constraints and Rewards with Meta-Gradient D4PG

Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present two soft-constrained RL approaches that utilize meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of these approaches by showing that they consistently outperform the baselines across four different Mujoco domains.

[1]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[2]  Alejandro Ribeiro,et al.  Constrained Reinforcement Learning Has Zero Duality Gap , 2019, NeurIPS.

[3]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[4]  Raia Hadsell,et al.  Value constrained model-free continuous control , 2019, ArXiv.

[5]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[6]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[7]  E. Altman Constrained Markov Decision Processes , 1999 .

[8]  Matthew E. Taylor,et al.  Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control , 2018, IJCAI.

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[11]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[12]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[13]  Kalyanmoy Deb,et al.  A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications , 2017, IEEE Transactions on Evolutionary Computation.

[14]  Bruno Castro da Silva,et al.  On Ensuring that Intelligent Machines Are Well-Behaved , 2017, ArXiv.

[15]  Hongxia Jin,et al.  Reward Constrained Interactive Recommendation with Natural Language Feedback , 2020, ArXiv.

[16]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[17]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[18]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[19]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[20]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[21]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[22]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[23]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[24]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[25]  Joelle Pineau,et al.  Constrained Markov Decision Processes via Backward Value Functions , 2020, ICML.

[26]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[27]  Junhyuk Oh,et al.  Self-Tuning Deep Reinforcement Learning , 2020, ArXiv.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Nir Levine,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.