Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies

Many real-world problems involve the optimization of multiple, possibly conicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple feedback signals, in essence, one for each objective. MORL is the process of learning policies that optimize multiple criteria simultaneously. In this paper, we present a novel temporal dierence learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach. This algorithm is a multi-policy algorithm that learns a set of Pareto dominating policies in a single run. We name this algorithm Pareto Q-learning and it is applicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto Q-learning is the updating mechanism that bootstraps sets of Q-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discounted reward vectors. This decomposition allows us to update the sets and to exploit the learned policies consistently throughout the state space. To balance exploration and exploitation during learning, we also propose three set evaluation mechanisms. These three mechanisms evaluate the sets of vectors to accommodate for standard action selection strategies, such as -greedy. More precisely, these mechanisms use multi-objective evaluation principles such as the hypervolume measure, the cardinality indicator and the Pareto dominance relation to select the most promising actions. We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto Q-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. We note that (1) Pareto Q-learning is able to learn the entire Pareto front under the usual assumption that each state-action pair is suciently sampled, while (2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learned policies can be retrieved throughout the state and action space.

[1]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[2]  John Yearwood,et al.  On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts , 2008, Australasian Conference on Artificial Intelligence.

[3]  J. Dennis,et al.  A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems , 1997 .

[4]  Stefan Roth,et al.  Covariance Matrix Adaptation for Multi-objective Optimization , 2007, Evolutionary Computation.

[5]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[6]  Nicola Beume,et al.  SMS-EMOA: Multiobjective selection based on dominated hypervolume , 2007, Eur. J. Oper. Res..

[7]  Patrice Perny,et al.  On Finding Compromise Solutions in Multiobjective Markov Decision Processes , 2010, ECAI.

[8]  Katia Jaffrès-Runser,et al.  Energy, latency and capacity trade-offs in wireless multi-hop networks , 2010, 21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications.

[9]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[10]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[11]  Shie Mannor,et al.  The Steering Approach for Multi-Criteria Reinforcement Learning , 2001, NIPS.

[12]  David Levine,et al.  Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[13]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[14]  M.A. Wiering,et al.  Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[15]  Michèle Sebag,et al.  Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search , 2013, Machine Learning.

[16]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[17]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[18]  Félix Hernández-del-Olmo,et al.  An emergent approach for the control of wastewater treatment plants by means of reinforcement learning techniques , 2012, Expert Syst. Appl..

[19]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation) , 2006 .

[20]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[21]  D. White Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[22]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[23]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[24]  Susan A. Murphy,et al.  Linear fitted-Q iteration with multiple reward functions , 2013, J. Mach. Learn. Res..

[25]  Marco Laumanns,et al.  SPEA2: Improving the Strength Pareto Evolutionary Algorithm For Multiobjective Optimization , 2002 .

[26]  Marco Laumanns,et al.  Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..