Direct Expected Quadratic Utility Maximization for Mean-Variance Controlled Reinforcement Learning

In real-world decision-making problems, risk management is critical. Among various risk management approaches, the mean-variance criterion is one of the most widely used in practice. In this paper, we suggest expected quadratic utility maximization (EQUM) as a new framework for policy gradient style reinforcement learning (RL) algorithms with mean-variance control. The quadratic utility function is a common objective of risk management in finance and economics. The proposed EQUM framework has several interpretations, such as reward-constrained variance minimization and regularization, as well as agent utility maximization. In addition, the computation of the EQUM framework is easier than that of existing mean-variance RL methods, which require double sampling. In experiments, we demonstrate the effectiveness of the proposed framework in the benchmarks of RL and financial data.

[1]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[2]  E. Fama,et al.  The Cross‐Section of Expected Stock Returns , 1992 .

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  R. Leal,et al.  Maximum Drawdown , 2005 .

[7]  Bin Wang,et al.  The Kelly Growth Optimal Portfolio with Ensemble Learning , 2019, AAAI.

[8]  Shimon Whiteson,et al.  Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[9]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[10]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[11]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[12]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[13]  Michael W. Brandt Portfolio Choice Problems , 2010 .

[14]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[16]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[17]  Jun Wang,et al.  Portfolio Blending via Thompson Sampling , 2016, IJCAI.

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[20]  J. Lewellen The Cross Section of Expected Stock Returns , 2014 .

[21]  Mohammad Ghavamzadeh,et al.  Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[22]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[23]  Victor DeMiguel,et al.  Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy? , 2009 .

[24]  W. Sharpe,et al.  Mean-Variance Analysis in Portfolio Choice and Capital Markets , 1987 .

[25]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[26]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[29]  Youyong Kong,et al.  Deep Direct Reinforcement Learning for Financial Signal Representation and Trading , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Marcello Restelli,et al.  Risk-Averse Trust Region Optimization for Reward-Volatility Reduction , 2019, IJCAI.

[31]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[32]  C. Patel Optimal versus Naive Diversification: How Inefficient Is the 1/N Portfolio Strategy? , 2009 .

[33]  Bo Liu,et al.  A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[34]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.