Distribution Estimation in Discounted MDPs via a Transformation

Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.

[1]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[2]  M. Woodroofe A central limit theorem for functions of a Markov chain with applications to shifts , 1992 .

[3]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[4]  E. Altman Constrained Markov Decision Processes , 1999 .

[5]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[6]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[7]  Jia Yuan Yu,et al.  Effect of Reward Function Choices in MDPs with Value-at-Risk , 2016, 1612.02088.

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[10]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[11]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[12]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[13]  Geoffrey S. Watson,et al.  Distribution Theory for Tests Based on the Sample Distribution Function , 1973 .

[14]  Galin L. Jones On the Markov chain central limit theorem , 2004, math/0409112.

[15]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[16]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[17]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[18]  Michael C. Fu,et al.  Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[19]  Matthew J. Sobel,et al.  Mean-Variance Tradeoffs in an Undiscounted MDP , 1994, Oper. Res..

[20]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.