A Constrained Multi-Objective Reinforcement Learning Framework

: Many real-world problems, especially in robotics, require that reinforcement learning (RL) agents learn policies that not only maximize an environment reward, but also satisfy constraints. We propose a high-level framework for solving such problems, that treats the environment reward and costs as separate objectives, and learns what preference over objectives the policy should optimize for in order to meet the constraints. We call this Learning Preferences and Policies in Parallel (LP3). By making different choices for how to learn the preference and how to optimize for the policy given the preference, we can obtain existing approaches (e.g., Lagrangian relaxation) and derive novel approaches that lead to better performance. One of these is an algorithm that learns a set of constraint-satisfying policies, useful for when we do not know the exact constraint a priori.

[1]  Wojciech Matusik,et al.  Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control , 2020, ICML.

[2]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[3]  Craig Boutilier,et al.  Data center cooling using model-predictive control , 2018, NeurIPS.

[4]  D. Mankowitz,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.

[5]  Raia Hadsell,et al.  Value constrained model-free continuous control , 2019, ArXiv.

[6]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[7]  Yuriy Brun,et al.  Preventing undesirable behavior of intelligent machines , 2019, Science.

[8]  E. Altman Constrained Markov Decision Processes , 1999 .

[9]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[10]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[11]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[12]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[13]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[14]  Pieter Abbeel,et al.  Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[15]  H. Francis Song,et al.  A Distributional View on Multi-Objective Policy Optimization , 2020, ICML.

[16]  Yiming Zhang,et al.  First Order Optimization in Policy Space for Constrained Deep Reinforcement Learning , 2020, ArXiv.

[17]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[18]  J. Dennis,et al.  A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems , 1997 .

[19]  Nicolas Le Roux,et al.  An operator view of policy gradient methods , 2020, NeurIPS.

[20]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[21]  Bernhard Sendhoff,et al.  On Test Functions for Evolutionary Multi-objective Optimization , 2004, PPSN.

[22]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[23]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[24]  David Levine,et al.  Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[25]  Miroslav Dudík,et al.  Reinforcement Learning with Convex Constraints , 2019, NeurIPS.

[26]  Karthik Narasimhan,et al.  Projection-Based Constrained Policy Optimization , 2020, ICLR.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[29]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[30]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[31]  Mladen Kolar,et al.  Convergent Policy Optimization for Safe Reinforcement Learning , 2019, NeurIPS.

[32]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[33]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[34]  Alejandro Ribeiro,et al.  Constrained Reinforcement Learning Has Zero Duality Gap , 2019, NeurIPS.

[35]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[36]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[37]  Qiuyi Zhang,et al.  Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization , 2020, ICML.

[38]  Junhyuk Oh,et al.  Balancing Constraints and Rewards with Meta-Gradient D4PG , 2020, ICLR.

[39]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[40]  Jan Peters,et al.  Manifold-based multi-objective policy search with sample reuse , 2017, Neurocomputing.

[41]  Marcello Restelli,et al.  Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation , 2016, J. Artif. Intell. Res..

[42]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.