Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping

Variance plays a key role in risk-sensitive reinforcement learning, and most risk measures can be analyzed via variance. In this paper, we consider two law-invariant risks as examples: mean-variance risk and exponential utility risk. With the aid of the state-augmentation transformation (SAT), we show that the two risks can be estimated in Markov decision processes (MDPs) with a stochastic transition-based reward and a randomized policy. To relieve the enlarged state space, a novel definition of isotopic states is proposed for state lumping, considering the special structure of the transformed transition probability. In the numerical experiment, we illustrate state lumping in the SAT, errors from a naive reward simplification, and the validity of the SAT for the two risk estimations.

[1]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[2]  Jia Yuan Yu,et al.  State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning , 2019, AAAI.

[3]  Wenjie Huang,et al.  Risk-aware Q-learning for Markov decision processes , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[4]  Balaraman Ravindran,et al.  Model Minimization in Hierarchical Reinforcement Learning , 2002, SARA.

[5]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[6]  M. Rosenblatt,et al.  A MARKOVIAN FUNCTION OF A MARKOV CHAIN , 1958 .

[7]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[8]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[9]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[10]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[11]  Li Xia Mean-variance optimization of discrete time discounted Markov decision processes , 2018, Autom..

[12]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[13]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[14]  S. Kusuoka On law invariant coherent risk measures , 2001 .

[15]  Tsan-Ming Choi,et al.  Mean–Variance Analysis for the Newsvendor Problem , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[16]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[17]  Peter G. Harrison,et al.  Performance modelling of communication networks and computer architectures , 1992, International computer science series.

[18]  Matthew J. Sobel,et al.  Mean-Variance Tradeoffs in an Undiscounted MDP , 1994, Oper. Res..

[19]  M. J. Sobel,et al.  Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[20]  E. Altman Constrained Markov Decision Processes , 1999 .

[21]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[22]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[23]  Tsan-Ming Choi,et al.  Supply chain risk analysis with mean-variance models: a technical review , 2016, Ann. Oper. Res..

[24]  Hon-Shiang Lau The Newsboy Problem under Alternative Optimization Objectives , 1980 .