论文信息 - Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

In most common settings of Markov Decision Process (MDP), an agent evaluate a policy based on expectation of (discounted) sum of rewards. However in many applications this criterion might not be suitable from two perspective: first, in risk aversion situation expectation of accumulated rewards is not robust enough, this is the case when distribution of accumulated reward is heavily skewed; another issue is that many applications naturally take several objective into consideration when evaluating a policy, for instance in autonomous driving an agent needs to balance speed and safety when choosing appropriate decision. In this paper, we consider evaluating a policy based on a sequence of quantiles it induces on a set of target states, our idea is to reformulate the original problem into a multi-objective MDP problem with lexicographic preference naturally defined. For computation of finding an optimal policy, we proposed an algorithm \textbf{FLMDP} that could solve general multi-objective MDP with lexicographic reward preference.

Yan Li | Zhaohan Sun

[1] Alex M. Andrew,et al. Reinforcement Learning: : An Introduction , 1998 .

[2] A. Shwartz,et al. Handbook of Markov decision processes : methods and applications , 2002 .

[3] Shlomo Zilberstein,et al. Revisiting Multi-Objective MDPs with Relaxed Lexicographic Preferences , 2015, AAAI Fall Symposia.

[4] Paul Weng,et al. Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences , 2011, ICAPS.

[5] Paul Weng,et al. Quantile Reinforcement Learning , 2016, ArXiv.

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7] M. J. Sobel. Ordinal Dynamic Programming , 1975 .

[8] Shlomo Zilberstein,et al. Multi-Objective MDPs with Conditional Lexicographic Reward Preferences , 2015, AAAI.

[9] L. G. Mitten. Preference Order Dynamic Programming , 1974 .