A temporal difference method for multi-objective reinforcement learning

Abstract This work describes MPQ-learning, an algorithm that approximates the set of all deterministic non-dominated policies in multi-objective Markov decision problems, where rewards are vectors and each component stands for an objective to maximize. MPQ-learning generalizes directly the ideas of Q-learning to the multi-objective case. It can be applied to non-convex Pareto frontiers and finds both supported and unsupported solutions. We present the results of the application of MPQ-learning to some benchmark problems. The algorithm solves successfully these problems, so showing the feasibility of this approach. We also compare MPQ-learning to a standard linearization procedure that computes only supported solutions and show that in some cases MPQ-learning can be as effective as the scalarization method.

[1]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[2]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[3]  Michèle Sebag,et al.  Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search , 2013, Machine Learning.

[4]  Visa Koivunen,et al.  Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks , 2011, Neurocomputing.

[5]  Yasuaki Kuroe,et al.  Multi-objective reinforcement learning method for acquiring all pareto optimal policies simultaneously , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[6]  Andrea Castelletti,et al.  Reinforcement learning in the operational management of a water system , 2002 .

[7]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[8]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[9]  D. White Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[10]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[11]  Manabu Yoshida,et al.  Parallel reinforcement learning for weighted multi-criteria model with adaptive margin , 2007, Cognitive Neurodynamics.

[12]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[13]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Manuela Ruiz-Montiel,et al.  Design with shape grammars and reinforcement learning , 2013, Adv. Eng. Informatics.

[16]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[17]  John Yearwood,et al.  On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts , 2008, Australasian Conference on Artificial Intelligence.

[18]  M.A. Wiering,et al.  Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[19]  Anders R. Kristensen,et al.  Dynamic programming and Markov decision processes , 1996 .

[20]  Joseph A. Paradiso,et al.  The gesture recognition toolkit , 2014, J. Mach. Learn. Res..

[21]  Lorenzo Mandow-Andaluz,et al.  PQ-learning: aprendizaje por refuerzo multiobjetivo , 2013 .

[22]  John S. Gero,et al.  A COMPARISON OF THREE METHODS FOR GENERATING THE PARETO OPTIMAL SET , 1984 .