Multi-objective Reinforcement Learning for the Expected Utility of the Return

Real-world decision problems often have multiple, possibly conflicting, objectives. In multi-objective reinforcement learning, the effects of actions in terms of these objectives must be learned by interacting with an environment. Typically, multi-objective reinforcement learning algorithms optimise the utility of the expected value of the returns. This implies the underlying assumption that it is indeed the expected value of the returns (i.e., an average returns over many runs) that is important to the user. However, this is not always the case. For example in a medical treatment setting only the return of a single run matters to the patient. This return is expressed in terms of multiple objectives such as maximising the probability of a full recovery and minimising the severity of side-effects. The utility of such a vector-valued return is often a non-linear combination of the return in each objective. In such cases, we should thus optimise the expected value of the utility of the returns, rather than the utility of the expected value of the returns. In this paper, we propose a novel method to do so, based on policy gradient, and show empirically that our method is key to learning good policies with respect to the expected value of the utility of the returns.

[1]  Michèle Sebag,et al.  Multi-objective Monte-Carlo Tree Search , 2012, ACML.

[2]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Ann Nowé,et al.  Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[5]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.

[6]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[7]  Peter Auer,et al.  Pareto Front Identification from Stochastic Bandit Feedback , 2016, AISTATS.

[8]  Shlomo Zilberstein,et al.  Multi-Objective POMDPs with Lexicographic Reward Preferences , 2015, IJCAI.

[9]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[10]  Marco Wiering,et al.  Model-based multi-objective reinforcement learning , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[11]  Ann Nowé,et al.  Interactive Thompson Sampling for Multi-objective Multi-armed Bandits , 2017, ADT.

[12]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[13]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[15]  Peter Vamplew,et al.  Steering approaches to Pareto-optimal multiobjective reinforcement learning , 2017, Neurocomputing.

[16]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[17]  Pablo Hernandez-Leal,et al.  Learning on a Budget Using Distributional RL , 2018 .

[18]  Pieter Libin,et al.  Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function , 2020 .

[19]  Shlomo Zilberstein,et al.  Multi-Objective MDPs with Conditional Lexicographic Reward Preferences , 2015, AAAI.

[20]  Shimon Whiteson,et al.  Multi-Objective Decision Making , 2017, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[21]  Peter Vamplew,et al.  MORL-Glue: a benchmark suite for multi-objective reinforcement learning , 2017 .

[22]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[23]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , 2014, Machine Learning.

[24]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[25]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[26]  Peter Vrancx,et al.  Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets , 2017, AAAI.

[27]  Shimon Whiteson,et al.  Point-Based Planning for Multi-Objective POMDPs , 2015, IJCAI.