论文信息 - Effect of Reward Function Choices in MDPs with Value-at-Risk

Effect of Reward Function Choices in MDPs with Value-at-Risk

In reinforcement learning, the reward function on current state and action is widely used. When the objective is about the expectation of the (discounted) total reward only, it works perfectly. However, if the objective involves the total reward distribution, the result will be wrong. This paper studies Value-at-Risk (VaR) problems in short- and long-horizon Markov decision processes (MDPs) with two reward functions, which share the same expectations. Firstly we show that with VaR objective, when the real reward function is transition-based (with respect to action and both current and next states), the simplified (state-based, with respect to action and current state only) reward function will change the VaR. Secondly, for long-horizon MDPs, we estimate the VaR function with the aid of spectral theory and the central limit theorem. Thirdly, since the estimation method is for a Markov reward process with the reward function on current state only, we present a transformation algorithm for the Markov reward process with the reward function on current and next states, in order to estimate the VaR function with an intact total reward distribution.

Jia Yuan Yu | Shuai Ma | Shuai Ma

[1] S. Meyn,et al. Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[2] Shie Mannor,et al. Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[3] D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[4] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[5] Olivier Buffet,et al. Revisiting Goal Probability Analysis in Probabilistic Planning , 2016, ICAPS.

[6] Hector Geffner,et al. Heuristic Search for Generalized Stochastic Shortest Path MDPs , 2011, ICAPS.

[7] Ping Hou,et al. Revisiting Risk-Sensitive MDPs: New Algorithms and Results , 2014, ICAPS.

[8] P. Glynn. A Lyapunov Bound for Solutions of Poisson's Equation , 1989 .

[9] E. Altman. Constrained Markov Decision Processes , 1999 .

[10] Michael C. Fu,et al. Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[11] Yoshio Ohtsubo,et al. Optimal policy for minimizing risk models in Markov decision processes , 2002 .