Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens.

[1]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[2]  Shie Mannor,et al.  Nonlinear Distributional Gradient Temporal-Difference Learning , 2018, ICML.

[3]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[4]  Francis Bach,et al.  A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise , 2019, COLT.

[5]  Tie-Yan Liu,et al.  Fully Parameterized Quantile Function for Distributional Reinforcement Learning , 2019, NeurIPS.

[6]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[7]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[8]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[9]  Tomas Bjork,et al.  A General Theory of Markovian Time Inconsistent Stochastic Control Problems , 2010 .

[10]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[11]  R. Rockafellar,et al.  Conditional Value-at-Risk for General Loss Distributions , 2001 .

[12]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[13]  William B. Haskell,et al.  A Convex Analytic Approach to Risk-Aware Markov Decision Processes , 2015, SIAM J. Control. Optim..

[14]  Bo Liu,et al.  A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[15]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[16]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[17]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Markov Decision Problem in Nearly Linear (Sometimes Sublinear) Time , 2020, Math. Oper. Res..

[18]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[19]  Alexandros Karatzoglou,et al.  Learning to rank for recommender systems , 2013, RecSys.

[20]  Mladen Kolar,et al.  Convergent Policy Optimization for Safe Reinforcement Learning , 2019, NeurIPS.

[21]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[22]  Josefa Mula,et al.  Quantitative models for supply chain planning under uncertainty: a review , 2009 .

[23]  André Roca,et al.  Identifying the processes underpinning anticipation and decision-making in a dynamic time-constrained task , 2011, Cognitive Processing.

[24]  Alec Koppel,et al.  Beyond Cumulative Returns via Reinforcement Learning over State-Action Occupancy Measures , 2021, 2021 American Control Conference (ACC).

[25]  Miguel Á. Carreira-Perpiñán,et al.  Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application , 2013, ArXiv.

[26]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[27]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[28]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[29]  Mohammad Ghavamzadeh,et al.  Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[30]  Tomas Björk,et al.  A theory of Markovian time-inconsistent stochastic control in discrete time , 2014, Finance Stochastics.

[31]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[32]  Prashanth L.A,et al.  Policy Gradients for CVaR-Constrained MDPs , 2014, 1405.2690.

[33]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[34]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[35]  V. Krishnamurthy,et al.  Implementation of gradient estimation to a constrained Markov decision problem , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[36]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[37]  Alejandro Ribeiro,et al.  Learning Safe Policies via Primal-Dual Methods , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[38]  Warren B. Powell,et al.  Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures , 2015, Math. Oper. Res..

[39]  Alexander Shapiro,et al.  Convex Approximations of Chance Constrained Programs , 2006, SIAM J. Optim..

[40]  Doina Precup,et al.  Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[43]  Sabrina M. Tom,et al.  The Neural Basis of Loss Aversion in Decision-Making Under Risk , 2007, Science.

[44]  C. Sims Implications of rational inattention , 2003 .

[45]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[46]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[47]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.

[48]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  A. PrashanthL. Policy Gradients for CVaR-Constrained MDPs , 2014, ALT.

[51]  Andrzej Ruszczynski,et al.  Risk-Averse Learning by Temporal Difference Methods , 2020, ArXiv.

[52]  Christoph Dann,et al.  Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy , 2020, AAAI.

[53]  G. Hunanyan,et al.  Portfolio Selection , 2019, Finanzwirtschaft, Banken und Bankmanagement I Finance, Banks and Bank Management.