论文信息 - Direct Expected Quadratic Utility Maximization for Mean-Variance Controlled Reinforcement Learning - 字舞流文

Direct Expected Quadratic Utility Maximization for Mean-Variance Controlled Reinforcement Learning

In real-world decision-making problems, risk management is critical. Among various risk management approaches, the mean-variance criterion is one of the most widely used in practice. In this paper, we suggest expected quadratic utility maximization (EQUM) as a new framework for policy gradient style reinforcement learning (RL) algorithms with mean-variance control. The quadratic utility function is a common objective of risk management in finance and economics. The proposed EQUM framework has several interpretations, such as reward-constrained variance minimization and regularization, as well as agent utility maximization. In addition, the computation of the EQUM framework is easier than that of existing mean-variance RL methods, which require double sampling. In experiments, we demonstrate the effectiveness of the proposed framework in the benchmarks of RL and financial data.

Masahiro Kato | Kei Nakagawa | Masahiro Kato | Kei Nakagawa

[1] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[2] E. Fama,et al. The Cross‐Section of Expected Stock Returns , 1992 .

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6] R. Leal,et al. Maximum Drawdown , 2005 .

[7] Bin Wang,et al. The Kelly Growth Optimal Portfolio with Ensemble Learning , 2019, AAAI.

[8] Shimon Whiteson,et al. Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[9] Fritz Wysotzki,et al. Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[10] Jing Peng,et al. Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[11] Mohammad Ghavamzadeh,et al. Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[12] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[13] Michael W. Brandt. Portfolio Choice Problems , 2010 .

[14] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[16] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[17] Jun Wang,et al. Portfolio Blending via Thompson Sampling , 2016, IJCAI.

[18] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19] Marco Pavone,et al. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[20] J. Lewellen. The Cross Section of Expected Stock Returns , 2014 .

[21] Mohammad Ghavamzadeh,et al. Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[22] J. Hull. Options, Futures, and Other Derivatives , 1989 .

[23] Victor DeMiguel,et al. Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy? , 2009 .

[24] W. Sharpe,et al. Mean-Variance Analysis in Portfolio Choice and Capital Markets , 1987 .

[25] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[26] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[29] Youyong Kong,et al. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[30] Marcello Restelli,et al. Risk-Averse Trust Region Optimization for Reward-Volatility Reduction , 2019, IJCAI.

[31] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[32] C. Patel. Optimal versus Naive Diversification: How Inefficient Is the 1/N Portfolio Strategy? , 2009 .

[33] Bo Liu,et al. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[34] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.