Continuous‐time mean–variance portfolio selection: A reinforcement learning framework

We approach the continuous-time mean-variance (MV) portfolio selection with reinforcement learning (RL). The problem is to achieve the best tradeoff between exploration and exploitation, and is formulated as an entropy-regularized, relaxed stochastic control problem. We prove that the optimal feedback policy for this problem must be Gaussian, with time-decaying variance. We then establish connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero. Finally, we prove a policy improvement theorem, based on which we devise an implementable RL algorithm. We find that our algorithm outperforms both an adaptive control based method and a deep neural networks based algorithm by a large margin in our simulations.

[1]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[2]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  R. H. Strotz Myopia and Inconsistency in Dynamic Utility Maximization , 1955 .

[5]  Andrew E. B. Lim,et al.  Dynamic Mean-Variance Portfolio Selection with No-Shorting Constraints , 2001, SIAM J. Control. Optim..

[6]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[7]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[8]  Mohammad Ghavamzadeh,et al.  Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[9]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[10]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[11]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[12]  Michael Kearns,et al.  Reinforcement learning for optimized trade execution , 2006, ICML.

[13]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[14]  Ronnie Sircar,et al.  Multiscale Stochastic Volatility Asymptotics , 2003, Multiscale Model. Simul..

[15]  Shie Mannor,et al.  Variance Adjusted Actor Critic Algorithms , 2013, ArXiv.

[16]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[17]  J. Moody,et al.  Performance functions and reinforcement learning for trading systems and portfolios , 1998 .

[18]  J. Cockcroft Investment in Science , 1962, Nature.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Thaleia Zariphopoulou,et al.  Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach , 2018, SSRN Electronic Journal.

[21]  X. Zhou,et al.  Continuous-Time Mean-Variance Portfolio Selection: A Stochastic LQ Framework , 2000 .

[22]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[23]  A. Lo,et al.  THE ECONOMETRICS OF FINANCIAL MARKETS , 1996, Macroeconomic Dynamics.

[24]  Duan Li,et al.  Optimal Dynamic Portfolio Selection: Multiperiod Mean‐Variance Formulation , 2000 .

[25]  D. Duffie,et al.  Mean-variance hedging in continuous time , 1991 .

[26]  Andrew E. B. Lim,et al.  Mean-Variance Portfolio Selection with Random Parameters in a Complete Market , 2002, Math. Oper. Res..

[27]  Shie Mannor,et al.  Temporal Difference Methods for the Variance of the Reward To Go , 2013, ICML.

[28]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[29]  Makoto Sato,et al.  Variance-Penalized Reinforcement Learning for Risk-Averse Asset Allocation , 2000, IDEAL.

[30]  X. Zhou,et al.  CONTINUOUS‐TIME MEAN‐VARIANCE PORTFOLIO SELECTION WITH BANKRUPTCY PROHIBITION , 2005 .

[31]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[32]  Dieter Hendricks,et al.  A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution , 2014, 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr).

[33]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[34]  Aleksandar Mijatovi'c,et al.  On the policy improvement algorithm in continuous time , 2015, 1509.09041.

[35]  Gang George Yin,et al.  Markowitz's Mean-Variance Portfolio Selection with Regime Switching: A Continuous-Time Model , 2003, SIAM J. Control. Optim..

[36]  John N. Tsitsiklis,et al.  Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[39]  Han-Fu Chen,et al.  Identification and Stochastic Adaptive Control , 1991 .

[40]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[41]  H. Waelbroeck,et al.  Optimal Execution of Portfolio Transactions with Short‐Term Alpha , 2013 .

[42]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[43]  Xun Yu Zhou,et al.  Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances , 2018, Manag. Sci..