Effect of Reward Function Choices in MDPs with Value-at-Risk

In reinforcement learning, the reward function on current state and action is widely used. When the objective is about the expectation of the (discounted) total reward only, it works perfectly. However, if the objective involves the total reward distribution, the result will be wrong. This paper studies Value-at-Risk (VaR) problems in short- and long-horizon Markov decision processes (MDPs) with two reward functions, which share the same expectations. Firstly we show that with VaR objective, when the real reward function is transition-based (with respect to action and both current and next states), the simplified (state-based, with respect to action and current state only) reward function will change the VaR. Secondly, for long-horizon MDPs, we estimate the VaR function with the aid of spectral theory and the central limit theorem. Thirdly, since the estimation method is for a Markov reward process with the reward function on current state only, we present a transformation algorithm for the Markov reward process with the reward function on current and next states, in order to estimate the VaR function with an intact total reward distribution.

[1]  S. Meyn,et al.  Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[2]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[3]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[4]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[5]  Olivier Buffet,et al.  Revisiting Goal Probability Analysis in Probabilistic Planning , 2016, ICAPS.

[6]  Hector Geffner,et al.  Heuristic Search for Generalized Stochastic Shortest Path MDPs , 2011, ICAPS.

[7]  Ping Hou,et al.  Revisiting Risk-Sensitive MDPs: New Algorithms and Results , 2014, ICAPS.

[8]  P. Glynn A Lyapunov Bound for Solutions of Poisson's Equation , 1989 .

[9]  E. Altman Constrained Markov Decision Processes , 1999 .

[10]  Michael C. Fu,et al.  Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[11]  Yoshio Ohtsubo,et al.  Optimal policy for minimizing risk models in Markov decision processes , 2002 .

[12]  Shie Mannor,et al.  Probabilistic Goal Markov Decision Processes , 2011, IJCAI.

[13]  Miguel A. Lejeune,et al.  An Exact Solution Approach for Portfolio Optimization Problems Under Stochastic and Integer Constraints , 2009, Oper. Res..

[14]  Jia Yuan Yu,et al.  Central-limit approach to risk-aware Markov decision processes , 2015, ArXiv.

[15]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[16]  Harry Zheng Efficient frontier of utility and CVaR , 2009, Math. Methods Oper. Res..

[17]  Mickael Randour,et al.  Percentile queries in multi-dimensional Markov decision processes , 2014, CAV.

[18]  Louis Wehenkel,et al.  Risk-aware decision making and dynamic programming , 2008 .

[19]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[20]  T. Vorst Optimal Portfolios under a Value at Risk Constraint , 2001 .

[21]  Peng Dai,et al.  Topological Value Iteration Algorithms , 2011, J. Artif. Intell. Res..

[22]  Matthew J. Sobel,et al.  Mean-Variance Tradeoffs in an Undiscounted MDP , 1994, Oper. Res..

[23]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[24]  Frank Riedel,et al.  Dynamic Coherent Risk Measures , 2003 .

[25]  Jerzy A. Filar,et al.  Time Consistent Dynamic Risk Measures , 2006, Math. Methods Oper. Res..

[26]  Stella X. Yu,et al.  Optimization Models for the First Arrival Target Distribution Function in Discrete Time , 1998 .

[27]  M. Bouakiz,et al.  Target-level criterion in Markov decision processes , 1995 .

[28]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[29]  Akifumi Kira,et al.  Threshold probability of non-terminal type in finite horizon Markov decision processes , 2012 .

[30]  Michael C. Fu,et al.  Cumulative Prospect Theory Meets Reinforcement Learning: Estimation and Control , 2015, ArXiv.

[31]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[32]  Congbin Wu,et al.  Minimizing risk models in Markov decision processes with policies depending on target values , 1999 .

[33]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[34]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..