Risk-sensitive reinforcement learning: a martingale approach to reward uncertainty

We introduce a novel framework to account for sensitivity to rewards uncertainty in sequential decision-making problems. While risk-sensitive formulations for Markov decision processes studied so far focus on the distribution of the cumulative reward as a whole, we aim at learning policies sensitive to the uncertain/stochastic nature of the rewards, which has the advantage of being conceptually more meaningful in some cases. To this end, we present a new decomposition of the randomness contained in the cumulative reward based on the Doob decomposition of a stochastic process, and introduce a new conceptual tool - the \textit{chaotic variation} - which can rigorously be interpreted as the risk measure of the martingale component associated to the cumulative reward process. We innovate on the reinforcement learning side by incorporating this new risk-sensitive approach into model-free algorithms, both policy gradient and value function based, and illustrate its relevance on grid world and portfolio optimization problems.

[1]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[2]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[3]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[4]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[5]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[6]  Mohammad Ghavamzadeh,et al.  Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[7]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[8]  V. Borkar Learning Algorithms for Risk-Sensitive Control , 2010 .

[9]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[10]  Iuliia Manziuk,et al.  Deep Reinforcement Learning for Market Making in Corporate Bonds: Beating the Curse of Dimensionality , 2019, Applied Mathematical Finance.

[11]  Louis Wehenkel,et al.  Risk-aware decision making and dynamic programming , 2008 .

[12]  Giacomo Scandolo,et al.  Conditional and dynamic convex risk measures , 2005, Finance Stochastics.

[13]  Manuela Veloso,et al.  Reinforcement Learning for Market Making in a Multi-agent Dealer Market , 2019, ArXiv.

[14]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[15]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[16]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[17]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[18]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[19]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[20]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[21]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[22]  Michael C. Fu,et al.  Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint , 2018, ArXiv.

[23]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[24]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[25]  Anatoliy Swishchuk,et al.  Strong Law of Large Numbers and Central Limit Theorems for Functionals of Inhomogeneous Semi-Markov Processes , 2015 .

[26]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[27]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[28]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..