MQLV: Modified Q-Learning for Vasicek Model

In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each state-action pairs an expected return. We call an optimal policy a policy for which the value function is optimal. QLBS, Q-Learner in the Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and noticeably, the popular Q-learning algorithm, to the financial stochastic model described by Black, Scholes and Merton. However, QLBS is specifically optimized for the geometric Brownian motion and the pricing of vanilla options. Consequently, it suffers from the traditional over-estimation of the Q-values reflected by an over-estimation of the vanilla option prices. Furthermore, its range of application is limited to vanilla option pricing within the financial markets. We propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement learning approach that limits the Q-values over-estimation observed in QLBS and extends the simulation to mean reverting stochastic diffusion processes. Additionally, MQLV uses a digital function to estimate the future probability of an event, thus widening the scope of the financial application to any other domain involving time series. Our experiments underline the potential of MQLV on generated Monte Carlo simulations, particularly representative of the retail banking time series. In particular, MQLV is able to determine the optimal policy of money management based on the aggregated financial transactions of the clients, unlocking new frontiers to establish personalized credit card limits or loans. Finally, MQLV is the first methodology compatible with the Vasicek model capable of an event probability estimation targeting simulation of event probabilities in retail banking.

[1]  R. C. Merton,et al.  Theory of Rational Option Pricing , 2015, World Scientific Reference on Contingent Claims Analysis in Corporate Finance.

[2]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[3]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[8]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[9]  Paul Wilmott,et al.  Paul Wilmott on Quantitative Finance , 2010 .

[10]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[11]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[12]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[13]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[14]  Oldrich A. Vasicek An equilibrium characterization of the term structure , 1977 .

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[17]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[18]  Martin A. Riedmiller,et al.  Deep auto-encoder neural networks in reinforcement learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[19]  Igor Halperin QLBS: Q-Learner in the Black-Scholes (-Merton) Worlds , 2017, ArXiv.

[20]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[21]  Ajay Kumar Tanwani,et al.  Autonomous reinforcement learning with experience replay. , 2013, Neural networks : the official journal of the International Neural Network Society.

[22]  Muhammad Ghifary,et al.  Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies , 2015, ArXiv.

[23]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[24]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[25]  H. Robbins A Stochastic Approximation Method , 1951 .

[26]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[27]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[28]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[29]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.