论文信息 - Balancing Constraints and Rewards with Meta-Gradient D4PG - 字舞流文

Balancing Constraints and Rewards with Meta-Gradient D4PG

Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present two soft-constrained RL approaches that utilize meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of these approaches by showing that they consistently outperform the baselines across four different Mujoco domains.

Junhyuk Oh | Nir Levine | Timothy A. Mann | Zhongwen Xu | Timothy Mann | Tom Zahavy | Daniel J. Mankowitz | Dan A. Calian | Junhyuk Oh | D. Mankowitz | Zhongwen Xu | Tom Zahavy | D. A. Calian | Nir Levine

[1] Satinder Singh,et al. On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[2] Alejandro Ribeiro,et al. Constrained Reinforcement Learning Has Zero Duality Gap , 2019, NeurIPS.

[3] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[4] Raia Hadsell,et al. Value constrained model-free continuous control , 2019, ArXiv.

[5] Gabriel Dulac-Arnold,et al. Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[6] Shie Mannor,et al. Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[7] E. Altman. Constrained Markov Decision Processes , 1999 .

[8] Matthew E. Taylor,et al. Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control , 2018, IJCAI.

[9] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[10] Shie Mannor,et al. Reward Constrained Policy Optimization , 2018, ICLR.

[11] Yuval Tassa,et al. Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[12] Paolo Frasconi,et al. Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[13] Kalyanmoy Deb,et al. A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications , 2017, IEEE Transactions on Evolutionary Computation.

[14] Bruno Castro da Silva,et al. On Ensuring that Intelligent Machines Are Well-Behaved , 2017, ArXiv.

[15] Hongxia Jin,et al. Reward Constrained Interactive Recommendation with Natural Language Feedback , 2020, ArXiv.

[16] Joelle Pineau,et al. Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[17] Dario Amodei,et al. Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[18] David Silver,et al. Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[19] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[20] Matthew W. Hoffman,et al. Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[21] Shie Mannor,et al. A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[22] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[23] Richard L. Lewis,et al. Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[24] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.

[25] Joelle Pineau,et al. Constrained Markov Decision Processes via Backward Value Functions , 2020, ICML.

[26] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[27] Junhyuk Oh,et al. Self-Tuning Deep Reinforcement Learning , 2020, ArXiv.

[28] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29] Nir Levine,et al. An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.