论文信息 - Policy Gradients Beyond Expectations: Conditional Value-at-Risk

Policy Gradients Beyond Expectations: Conditional Value-at-Risk

Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains such as finance. In this work we present a new formula for the gradient of the CVaR in the form of a conditional expectation. Our result is similar to policy gradients in the reinforcement learning literature. Based on this formula, we propose novel sampling-based estimators for the CVaR gradient, and a corresponding gradient descent procedure for CVaR optimization. We evaluate our approach in learning a risk-sensitive controller for the game of Tetris, and propose an importance sampling procedure that is suitable for such domains.

[1] Bruno Scherrer,et al. Improvements on Learning Tetris with Cross Entropy , 2009, J. Int. Comput. Games Assoc..

[2] Marek Petrik,et al. An Approximate Solution Method for Large Risk-Averse Markov Decision Processes , 2012, UAI.

[3] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[4] Alexander Shapiro,et al. Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[5] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6] Abaxbank,et al. Spectral Measures of Risk : a Coherent Representation of Subjective Risk Aversion , 2002 .

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[9] Harley Flanders,et al. Differentiation Under the Integral Sign , 1973 .

[10] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[11] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[12] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[13] J. Cockcroft. Investment in Science , 1962, Nature.

[14] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[15] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[16] Vivek S. Borkar,et al. A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[17] L. Jeff Hong,et al. Simulating Sensitivities of Conditional Value at Risk , 2009, Manag. Sci..

[18] Jerzy A. Filar,et al. Time Consistent Dynamic Risk Measures , 2006, Math. Methods Oper. Res..

[19] Bruno Scherrer,et al. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[20] V. Agarwal,et al. Risks and Portfolio Decisions Involving Hedge Funds , 2004 .

[21] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[22] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[23] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[24] J. Carson. Simulation and the Monte Carlo Method , 1982 .

[25] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[26] O. Scaillet. Nonparametric Estimation and Sensitivity Analysis of Expected Shortfall , 2004 .

[27] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[28] P. Glynn. IMPORTANCE SAMPLING FOR MONTE CARLO ESTIMATION OF QUANTILES , 2011 .