Tree-based Fitted Q-iteration for Multi-Objective Markov Decision problems

This paper is about solving multi-objective control problems using a model-free batch-mode reinforcement-learning approach. Although many real-world applications have several conflicting objectives, reinforcement-learning (RL) literature has mainly focused on single-objective control problems. As a consequence, in the presence of multiple objectives, the usual approach is to consider many single-objective control problems (resulting from different combinations of the original problem objectives), each one solved using standard RL techniques. The algorithm proposed in this paper is an extension of Fitted Q-iteration (FQI) that enables to learn the control policies for all the linear combinations of preferences (weights) assigned to the objectives in a single training process. The key idea of multi-objective FQI (MOFQI) is to enlarge the continuous approximation of the action-value function, which is performed by single-objective FQI over the state-action space, also to the weight space. The approach is demonstrated on an interesting real-world application for multi-objective RL algorithms: the optimal operation of a multi-purpose water reservoir.

[1]  Andrea Castelletti,et al.  Assessing water reservoirs management and development in Northern Vietnam , 2011 .

[2]  Marcello Restelli,et al.  Tree‐based reinforcement learning for optimal water reservoir operation , 2010 .

[3]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[4]  Thomas A. Henzinger,et al.  Markov Decision Processes with Multiple Objectives , 2006, STACS.

[5]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[6]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[7]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[8]  M. Hansen,et al.  Evaluating the quality of approximations to the non-dominated set , 1998 .

[9]  F. Pianosi,et al.  Assessing water resources management and development in Northern Vietnam , 2011 .

[10]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[11]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[12]  Patrice Perny,et al.  On Finding Compromise Solutions in Multiobjective Markov Decision Processes , 2010, ECAI.

[13]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[14]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[15]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[16]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[17]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[18]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[19]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[20]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .