On Multi-objective Policy Optimization as a Tool for Reinforcement Learning

Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives, or constraints, in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors when learning from experts or in offline RL. Often, task reward and auxiliary objectives are in conflict with each other and it is therefore natural to treat these examples as instances of multiobjective (MO) optimization problems. We study the principles underlying MORL and introduce a new algorithm, Distillation of a Mixture of Experts (DiME), that is intuitive and scale-invariant under some conditions. We highlight its strengths on standard MO benchmark problems and consider case studies in which we recast offline RL and learning from experts as MO problems. This leads to a natural algorithmic formulation that sheds light on the connection between existing approaches. For offline RL, we use the MO perspective to derive a simple algorithm, that optimizes for the standard RL objective plus a behavioral cloning term. This outperforms state-of-the-art on two established offline RL benchmarks.

[1]  M. Zuluaga,et al.  ε-PAL: an active learning approach to the multi-objective optimization problem , 2016 .

[2]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[3]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[4]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[5]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[6]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[7]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[8]  David Levine,et al.  Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[9]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[10]  Runzhe Yang,et al.  A Generalized Algorithm for Multi-Objective RL and Policy Adaptation , 2019 .

[11]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[12]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Bernhard Sendhoff,et al.  On Test Functions for Evolutionary Multi-objective Optimization , 2004, PPSN.

[15]  Wojciech Matusik,et al.  Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control , 2020, ICML.

[16]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[17]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[18]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[19]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[20]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[22]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[23]  David Silver,et al.  Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[24]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[25]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[26]  Jan Peters,et al.  Manifold-based multi-objective policy search with sample reuse , 2017, Neurocomputing.

[27]  Qiuyi Zhang,et al.  Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization , 2020, ICML.

[28]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[29]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[30]  H. Francis Song,et al.  A Distributional View on Multi-Objective Policy Optimization , 2020, ICML.

[31]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[32]  Jackie Kay,et al.  Learning Dexterous Manipulation from Suboptimal Experts , 2020, ArXiv.

[33]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[34]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[35]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[36]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[37]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[38]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[39]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[40]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[41]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[42]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[43]  Marcello Restelli,et al.  Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation , 2016, J. Artif. Intell. Res..

[44]  J. Dennis,et al.  A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems , 1997 .

[45]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[46]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[47]  Sonia Chernova,et al.  Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[48]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[49]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[50]  Nicolas Le Roux,et al.  An operator view of policy gradient methods , 2020, NeurIPS.