A Natural Actor-Critic Algorithm with Downside Risk Constraints

Existing work on risk-sensitive reinforcement learning - both for symmetric and downside risk measures - has typically used direct Monte-Carlo estimation of policy gradients. While this approach yields unbiased gradient estimates, it also suffers from high variance and decreased sample efficiency compared to temporal-difference methods. In this paper, we study prediction and control with aversion to downside risk which we gauge by the lower partial moment of the return. We introduce a new Bellman equation that upper bounds the lower partial moment, circumventing its non-linearity. We prove that this proxy for the lower partial moment is a contraction, and provide intuition into the stability of the algorithm by variance decomposition. This allows sample-efficient, on-line estimation of partial moments. For risk-sensitive control, we instantiate Reward Constrained Policy Optimization, a recent actor-critic method for finding constrained policies, with our proxy for the lower partial moment. We extend the method to use natural policy gradients and demonstrate the effectiveness of our approach on three benchmark problems for risk-sensitive reinforcement learning.

[1]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[2]  E. Altman Constrained Markov Decision Processes , 1999 .

[3]  Richard S. Sutton,et al.  Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods , 2018 .

[4]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[5]  Jon Danielsson,et al.  Consistent measures of risk , 2006 .

[6]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[7]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[8]  A. Tversky,et al.  Prospect Theory : An Analysis of Decision under Risk Author ( s ) : , 2007 .

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Matteo Hessel,et al.  General non-linear Bellman equations , 2019, ArXiv.

[11]  R. C. Merton,et al.  Lifetime Portfolio Selection under Uncertainty: The Continuous-Time Case , 1969 .

[12]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[13]  Marcello Restelli,et al.  Risk-Averse Trust Region Optimization for Reward-Volatility Reduction , 2019, IJCAI.

[14]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[15]  M. Lebreton,et al.  Behavioural and neural characterization of optimistic reinforcement learning , 2017, Nature Human Behaviour.

[16]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[19]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[20]  A. Tversky,et al.  Advances in prospect theory: Cumulative representation of uncertainty , 1992 .

[21]  F. Sortino,et al.  Performance Measurement in a Downside Risk Framework , 1994 .

[22]  Karl Tuyls,et al.  Robust temporal difference learning for critical domains , 2019, AAMAS.

[23]  John N. Tsitsiklis,et al.  Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[24]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[25]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[26]  Giovanni Walter Puopolo Portfolio selection with transaction costs and default risk , 2017 .

[27]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[28]  S. M. Sunoj,et al.  Some properties of conditional partial moments in the context of stochastic modelling , 2019 .

[29]  M. Ma,et al.  FOUNDATIONS OF PORTFOLIO THEORY , 1990 .

[30]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[31]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[32]  Simone Farinelli,et al.  Sharpe thinking in asset ranking with one-sided measures , 2008, Eur. J. Oper. Res..

[33]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[34]  Rahul Savani,et al.  Robust Market Making via Adversarial Reinforcement Learning , 2020, AAMAS.

[35]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[36]  Qihe Tang,et al.  Solvency capital, risk measures and comonotonicity: a review , 2004 .

[37]  Alex Weissensteiner,et al.  A $Q$ -Learning Approach to Derive Optimal Consumption and Investment Strategies , 2008, IEEE Transactions on Neural Networks.

[38]  A. Young Prospect Theory: An Analysis of Decision Under Risk (Kahneman and Tversky, 1979) , 2011 .

[39]  Martha White,et al.  Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods , 2018, ArXiv.

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  P. Fishburn Mean-Risk Analysis with Risk Associated with Below-Target Returns , 1977 .

[42]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[43]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.