论文信息 - Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes

Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes

This paper describes a novel algorithm called CON-MODP for computing Pareto optimal policies for deterministic multi-objective sequential decision problems. CON-MODP is a value iteration based multi-objective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multi-objective Markov decision processes. Experiments show that CON-MODP is much faster than previous multi-objective value iteration algorithms.

M.A. Wiering | E.D. de Jong | M. Wiering | E. de Jong | E. Jong

[1] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[2] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[3] Sandra Clara Gadanho,et al. Learning Behavior-Selection by Emotions and Cognition in a Multi-Goal Robot Task , 2003, J. Mach. Learn. Res..

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[5] Shie Mannor,et al. A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[6] D. White. Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[7] Konkoly Thege. Multi-criteria Reinforcement Learning , 1998 .

[8] Mark Humphreys,et al. Action selection methods using reinforcement learning , 1997 .