In most common settings of Markov Decision Process (MDP), an agent evaluate a policy based on expectation of (discounted) sum of rewards. However in many applications this criterion might not be suitable from two perspective: first, in risk aversion situation expectation of accumulated rewards is not robust enough, this is the case when distribution of accumulated reward is heavily skewed; another issue is that many applications naturally take several objective into consideration when evaluating a policy, for instance in autonomous driving an agent needs to balance speed and safety when choosing appropriate decision. In this paper, we consider evaluating a policy based on a sequence of quantiles it induces on a set of target states, our idea is to reformulate the original problem into a multi-objective MDP problem with lexicographic preference naturally defined. For computation of finding an optimal policy, we proposed an algorithm \textbf{FLMDP} that could solve general multi-objective MDP with lexicographic reward preference.
[1]
Alex M. Andrew,et al.
Reinforcement Learning: : An Introduction
,
1998
.
[2]
A. Shwartz,et al.
Handbook of Markov decision processes : methods and applications
,
2002
.
[3]
Shlomo Zilberstein,et al.
Revisiting Multi-Objective MDPs with Relaxed Lexicographic Preferences
,
2015,
AAAI Fall Symposia.
[4]
Paul Weng,et al.
Markov Decision Processes with Ordinal Rewards: Reference Point-Based Preferences
,
2011,
ICAPS.
[5]
Paul Weng,et al.
Quantile Reinforcement Learning
,
2016,
ArXiv.
[6]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[7]
M. J. Sobel.
Ordinal Dynamic Programming
,
1975
.
[8]
Shlomo Zilberstein,et al.
Multi-Objective MDPs with Conditional Lexicographic Reward Preferences
,
2015,
AAAI.
[9]
L. G. Mitten.
Preference Order Dynamic Programming
,
1974
.