Multi-objective fitted Q-iteration: Pareto frontier approximation in one single run

We present a novel batch-mode Reinforcement Learning approach for the design of optimal controllers in the presence of multiple objectives. The algorithm is an extension of Fitted Q-iteration (FQI) that enables to design the controller for all the linear combinations of preferences (weights) assigned to the objectives in a single run. The key idea of multi-objective FQI (MOFQI) is to enlarge the continuos approximation of the value function, which is performed by single-objective FQI over the state-control space, also to the weight space. The bacth-mode nature of the algorithm makes it possible the enrichment of the learning data with nearly no additional computational cost with respect to a single-objective formulation on the same system. The approach was tested on a simple test case study concerning the optimal operation of a two-objective water reservoir, where MOFQI algorithm proved to be computationally preferable over repeatedly running FQI for different weight values when more than five points on the Pareto frontier are considered.

[1]  Geoffrey J. Gordon Online Fitted Reinforcement Learning , 1995 .

[2]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[3]  M. Hansen,et al.  Evaluating the quality of approximations to the non-dominated set , 1998 .

[4]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[5]  Shie Mannor,et al.  The Steering Approach for Multi-Criteria Reinforcement Learning , 2001, NIPS.

[6]  Andrea Castelletti,et al.  Reinforcement learning in the operational management of a water system , 2002 .

[7]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[8]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[9]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[10]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[11]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[12]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[13]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[14]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[15]  S. Timmer,et al.  Fitted Q Iteration with CMACs , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[16]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[17]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[18]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[19]  Andrea Bonarini,et al.  Batch Reinforcement Learning for Controlling a Mobile Wheeled Pendulum Robot , 2008, IFIP AI.

[20]  Eckart Zitzler,et al.  Objective Reduction in Evolutionary Multiobjective Optimization: Theory and Applications , 2009, Evolutionary Computation.

[21]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[22]  Marcello Restelli,et al.  Tree‐based reinforcement learning for optimal water reservoir operation , 2010 .