Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains. MVPI adopts a per-step reward perspective (Bisi et al., 2019) for risk-averse control, instead of the commonly used total reward perspective.

[1]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[2]  W. Sharpe,et al.  Mean-Variance Analysis in Portfolio Choice and Capital Markets , 1987 .

[3]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[8]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Shaun S. Wang A CLASS OF DISTORTION OPERATORS FOR PRICING FINANCIAL AND INSURANCE RISKS , 2000 .

[11]  Duan Li,et al.  Optimal Dynamic Portfolio Selection: Multiperiod Mean‐Variance Formulation , 2000 .

[12]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[13]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[14]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[15]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[16]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  Jack L. Treynor,et al.  MUTUAL FUND PERFORMANCE* , 2007 .

[22]  D. Parker Managing risk in healthcare: understanding your safety culture using the Manchester Patient Safety Framework (MaPSaF). , 2009, Journal of nursing management.

[23]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[24]  Ambuj Tewari,et al.  On the Finite Time Convergence of Cyclic Coordinate Descent Methods , 2010, ArXiv.

[25]  Haipeng Xing,et al.  Mean--variance portfolio optimization when means and covariances are unknown , 2011, 1108.0996.

[26]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[27]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[28]  Shalabh Bhatnagar,et al.  Stochastic Recursive Algorithms for Optimization , 2012 .

[29]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[30]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[31]  Ambuj Tewari,et al.  On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods , 2013, SIAM J. Optim..

[32]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[33]  Sajal K. Das,et al.  Beyond exponential utility functions: A variance-adjusted approach for risk-averse reinforcement learning , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[34]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[35]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[36]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[37]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[40]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[41]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[42]  Hermann Winner,et al.  Autonomous Driving: Technical, Legal and Social Aspects , 2016 .

[43]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[44]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[45]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[46]  Phillipp Kaestner,et al.  Linear And Nonlinear Programming , 2016 .

[47]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[49]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[50]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[51]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[52]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[53]  Marco Pavone,et al.  How Should a Robot Assess Risk? Towards an Axiomatic Theory of Risk in Robotics , 2017, ISRR.

[54]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[55]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[56]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[57]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[58]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[59]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[60]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[61]  Bo Liu,et al.  A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[62]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[63]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[64]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[65]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[66]  Marcello Restelli,et al.  Risk-Averse Trust Region Optimization for Reward-Volatility Reduction , 2019, IJCAI.

[67]  Qi Cai,et al.  Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy , 2019, NeurIPS.

[68]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[69]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[70]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[71]  Harm van Seijen,et al.  Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[72]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[73]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[74]  S. Whiteson,et al.  GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[75]  Hengshuai Yao,et al.  Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[76]  Qiang Liu,et al.  Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.