A Generalized Algorithm for Multi-Objective RL and Policy Adaptation

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

[1]  Stefan Ultes,et al.  Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning , 2017, SIGDIAL Conference.

[2]  Navdeep Jaitly,et al.  Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[3]  Andrea Castelletti,et al.  Multi-objective fitted Q-iteration: Pareto frontier approximation in one single run , 2011, 2011 International Conference on Networking, Sensing and Control.

[4]  Tuomas Sandholm,et al.  Preference elicitation in combinatorial auctions , 2002, EC '01.

[5]  W. A. Kirk,et al.  An Introduction to Metric Spaces and Fixed Point Theory , 2001 .

[6]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[7]  Li Chen,et al.  Survey of Preference Elicitation Methods , 2004 .

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[10]  Hirotaka Nakayama,et al.  Sequential Approximate Multiobjective Optimization Using Computational Intelligence , 2009, Vector Optimization.

[11]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[12]  Shimon Whiteson,et al.  Multi-Objective Deep Reinforcement Learning , 2016, ArXiv.

[13]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[14]  Dewen Hu,et al.  Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[15]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[16]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  L. Watson,et al.  Modern homotopy methods in optimization , 1989 .

[19]  Manabu Yoshida,et al.  Parallel reinforcement learning for weighted multi-criteria model with adaptive margin , 2007, Cognitive Neurodynamics.

[20]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[21]  David W. Coit,et al.  Multi-objective optimization using genetic algorithms: A tutorial , 2006, Reliab. Eng. Syst. Saf..

[22]  Dimitri P. Bertsekas,et al.  Regular Policies in Abstract Dynamic Programming , 2016, SIAM J. Optim..

[23]  Shie Mannor,et al.  The Steering Approach for Multi-Criteria Reinforcement Learning , 2001, NIPS.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  Yasuaki Kuroe,et al.  Multi-objective reinforcement learning for acquiring all Pareto optimal policies simultaneously - Method of determining scalarization weights , 2014, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[26]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[27]  I. Kim,et al.  Adaptive weighted sum method for multiobjective optimization: a new method for Pareto front generation , 2006 .

[28]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[29]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[30]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[31]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[32]  JiGuan G. Lin On min-norm and min-max methods of multi-objective optimization , 2005, Math. Program..

[33]  Andrea Castelletti,et al.  Tree-based Fitted Q-iteration for Multi-Objective Markov Decision problems , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[34]  Steve J. Young,et al.  The Hidden Agenda User Simulation Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Daphne Koller,et al.  Learning an Agent's Utility Function by Observing Behavior , 2001, ICML.

[36]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[37]  David Vandyke,et al.  PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[38]  Jan Peters,et al.  Manifold-based multi-objective policy search with sample reuse , 2017, Neurocomputing.

[39]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[40]  David Levine,et al.  Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[41]  Marcello Restelli,et al.  Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation , 2014, AAAI.

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[43]  Arch W. Naylor,et al.  Linear Operator Theory in Engineering and Science , 1971 .

[44]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[45]  Tom Lenaerts,et al.  Dynamic Weights in Multi-Objective Deep Reinforcement Learning , 2018, ICML.

[46]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[47]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[48]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[49]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.