论文信息 - Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning - 字舞流文

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains. MVPI adopts a per-step reward perspective (Bisi et al., 2019) for risk-averse control, instead of the commonly used total reward perspective.

Shimon Whiteson | Shangtong Zhang | Bo Liu | S. Whiteson | Shangtong Zhang | Bo Liu | Shimon Whiteson

[1] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[2] W. Sharpe,et al. Mean-Variance Analysis in Portfolio Choice and Capital Markets , 1987 .

[3] Jerzy A. Filar,et al. Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[4] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[6] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7] Daniel Hernández-Hernández,et al. Risk Sensitive Markov Decision Processes , 1997 .

[8] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[9] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10] Shaun S. Wang. A CLASS OF DISTORTION OPERATORS FOR PRICING FINANCIAL AND INSURANCE RISKS , 2000 .

[11] Duan Li,et al. Optimal Dynamic Portfolio Selection: Multiperiod Mean‐Variance Formulation , 2000 .

[12] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[13] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[14] Vivek S. Borkar,et al. Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[15] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[16] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[17] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21] Jack L. Treynor,et al. MUTUAL FUND PERFORMANCE* , 2007 .

[22] D. Parker. Managing risk in healthcare: understanding your safety culture using the Manchester Patient Safety Framework (MaPSaF). , 2009, Journal of nursing management.

[23] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[24] Ambuj Tewari,et al. On the Finite Time Convergence of Cyclic Coordinate Descent Methods , 2010, ArXiv.

[25] Haipeng Xing,et al. Mean--variance portfolio optimization when means and covariances are unknown , 2011, 1108.0996.

[26] John N. Tsitsiklis,et al. Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[27] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[28] Shalabh Bhatnagar,et al. Stochastic Recursive Algorithms for Optimization , 2012 .

[29] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[30] Mohammad Ghavamzadeh,et al. Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[31] Ambuj Tewari,et al. On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods , 2013, SIAM J. Optim..

[32] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[33] Sajal K. Das,et al. Beyond exponential utility functions: A variance-adjusted approach for risk-averse reinforcement learning , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[34] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[35] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[36] Shie Mannor,et al. Optimizing the CVaR via Sampling , 2014, AAAI.

[37] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[38] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[39] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[40] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[41] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[42] Hermann Winner,et al. Autonomous Driving: Technical, Legal and Social Aspects , 2016 .

[43] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[44] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[45] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[46] Phillipp Kaestner,et al. Linear And Nonlinear Programming , 2016 .

[47] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[49] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[50] Doina Precup,et al. The Option-Critic Architecture , 2016, AAAI.

[51] Marco Pavone,et al. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[52] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[53] Marco Pavone,et al. How Should a Robot Assess Risk? Towards an Axiomatic Theory of Risk in Robotics , 2017, ISRR.

[54] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[55] Marcello Restelli,et al. Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[56] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[57] David Silver,et al. Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[58] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[59] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[60] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[61] Bo Liu,et al. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[62] Richard S. Sutton,et al. Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[63] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[64] Albin Cassirer,et al. Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[65] Ilya Kostrikov,et al. AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[66] Marcello Restelli,et al. Risk-Averse Trust Region Optimization for Reward-Volatility Reduction , 2019, IJCAI.

[67] Qi Cai,et al. Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy , 2019, NeurIPS.

[68] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[69] Emma Brunskill,et al. Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[70] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[71] Harm van Seijen,et al. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[72] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[73] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[74] S. Whiteson,et al. GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[75] Hengshuai Yao,et al. Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[76] Qiang Liu,et al. Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.