论文信息 - Distribution Estimation in Discounted MDPs via a Transformation

Distribution Estimation in Discounted MDPs via a Transformation

Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.

Jia Yuan Yu | Shuai Ma

[1] John N. Tsitsiklis,et al. Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[2] M. Woodroofe. A central limit theorem for functions of a Markov chain with applications to shifts , 1992 .

[3] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[4] E. Altman. Constrained Markov Decision Processes , 1999 .

[5] Alexander Shapiro,et al. Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[6] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[7] Jia Yuan Yu,et al. Effect of Reward Function Choices in MDPs with Value-at-Risk , 2016, 1612.02088.

[8] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9] D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[10] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[11] Laurent El Ghaoui,et al. Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..