State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

In the framework of MDP, although the general reward function takes three arguments—current state, action, and successor state; it is often simplified to a function of two arguments—current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective involves the expected total reward only, this simplification works perfectly. However, when the objective is risk-sensitive, this simplification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs), which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for every MDP with a stochastic transition-based reward function, there exists an MDP with a deterministic state-based reward function, such that for any given (randomized) policy for the first MDP, there exists a corresponding policy for the second MDP, such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions, and the other could be using methods, which are for the Markov processes with a deterministic state-based reward functions, on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example, which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simplification, and show how the SATs enable a variance formula to work on Markov processes with general reward functions.

[1]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[2]  Jia Yuan Yu,et al.  Effect of Reward Function Choices in MDPs with Value-at-Risk , 2016, 1612.02088.

[3]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[4]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[5]  J. Durbin Distribution theory for tests based on the sample distribution function , 1973 .

[6]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[7]  Matthew J. Sobel,et al.  Mean-Variance Tradeoffs in an Undiscounted MDP , 1994, Oper. Res..

[8]  Frank Riedel,et al.  Dynamic Coherent Risk Measures , 2003 .

[9]  E. Altman Constrained Markov Decision Processes , 1999 .

[10]  Michael C. Fu,et al.  Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[11]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[12]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[13]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[14]  M. Woodroofe A central limit theorem for functions of a Markov chain with applications to shifts , 1992 .

[15]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  Sebastian Junges,et al.  Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[18]  Wenjie Huang,et al.  Risk-aware Q-learning for Markov decision processes , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[19]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[22]  Paul Weng,et al.  Quantile Reinforcement Learning , 2016, ArXiv.

[23]  S. Kusuoka On law invariant coherent risk measures , 2001 .

[24]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[25]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[26]  Koichiro Yamauchi,et al.  Risk Sensitive Reinforcement Learning Scheme Is Suitable for Learning on a Budget , 2016, ICONIP.

[27]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[28]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[29]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[30]  Congbin Wu,et al.  Minimizing risk models in Markov decision processes with policies depending on target values , 1999 .

[31]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..