Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture

Deep learning architectures, specifically Deep Momentum Networks (DMNs) [1], have been found to be an effective approach to momentum and mean-reversion trading. However, some of the key challenges in recent years involve learning long-term dependencies, degradation of performance when considering returns net of transaction costs and adapting to new market regimes, notably during the SARS-CoV-2 crisis. Attention mechanisms, or Transformer-based architectures, are a solution to such challenges because they allow the network to focus on significant time steps in the past and longer-term patterns. We introduce the Momentum Transformer, an attentionbased architecture which outperforms the benchmarks, and is inherently interpretable, providing us with greater insights into our deep learning trading strategy. Our model is an extension to the LSTM-based DMN, which directly outputs position sizing by optimising the network on a risk-adjusted performance metric, such as Sharpe ratio. We find an attention-LSTM hybrid Decoder-Only Temporal Fusion Transformer (TFT) style architecture is the best performing model. In terms of interpretability, we observe remarkable structure in the attention patterns, with significant peaks of importance at momentum turning points. The time series is thus segmented into regimes and the model tends to focus on previous time-steps in alike regimes. We find changepoint detection (CPD) [2], another technique for responding to regime change, can complement multi-headed attention, especially when we run CPD at multiple timescales. Through the addition of an interpretable variable selection network, we observe how CPD helps our model to move away from trading predominantly on daily returns data. We note that the model can intelligently switch between, and blend, classical strategies – basing its decision on patterns in the data. ∗Kieran Wood is the corresponding author and can be contacted via email: kieran.wood@eng.ox.ac.uk.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  J. Poterba,et al.  Mean Reversion in Stock Prices: Evidence and Implications , 1987 .

[3]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Steven Reece,et al.  Sequential Bayesian Prediction in the Presence of Changepoints and Faults , 2010, Comput. J..

[6]  Narasimhan Jegadeesh,et al.  Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency , 1993 .

[7]  W. Sharpe CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM UNDER CONDITIONS OF RISK* , 1964 .

[8]  R. Thaler,et al.  Does the Stock Market Overreact , 1985 .

[9]  Robert Kosowski,et al.  Momentum Strategies in Futures Markets and Trend-following Funds , 2013 .

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Campbell R. Harvey,et al.  Momentum Turning Points , 2019, SSRN Electronic Journal.

[12]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[13]  Bryan Lim,et al.  Time-series forecasting with deep learning: a survey , 2020, Philosophical Transactions of the Royal Society A.

[14]  Edward Hoyle,et al.  The Impact of Volatility Targeting , 2018, The Journal of Portfolio Management.

[15]  Narasimhan Jegadeesh,et al.  Seasonality in Stock Price Mean Reversion: Evidence from the U.S. and the U.K. , 1991 .

[16]  Justin A. Sirignano,et al.  Universal features of price formation in financial markets: perspectives from deep learning , 2018, Machine Learning and AI in Finance.

[17]  Cheng Guo,et al.  Entity Embeddings of Categorical Variables , 2016, ArXiv.

[18]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[19]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[22]  Jean-Philippe Bouchaud,et al.  Two centuries of trend following , 2014, 1404.3274.

[23]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Nicolas Loeff,et al.  Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting , 2021, International Journal of Forecasting.

[26]  Yao Hua Ooi,et al.  Time Series Momentum , 2011 .

[27]  Obert,et al.  Demystifying Time-Series Momentum Strategies: Volatility Estimators, Trading Rules and Pairwise Correlations , 2013, Market Momentum.

[28]  Zihao Zhang,et al.  Deep Reinforcement Learning for Trading , 2019, The Journal of Financial Data Science.

[29]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[30]  Stefan Zohren,et al.  Enhancing Time-Series Momentum Strategies Using Deep Neural Networks , 2019, The Journal of Financial Data Science.

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Hui Xiong,et al.  Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , 2020, AAAI.

[34]  Campbell R. Harvey,et al.  Dissecting Investment Strategies in the Cross Section and Time Series , 2015 .

[35]  Stephen Roberts,et al.  Enhancing Cross-Sectional Currency Strategies by Ranking Refinement with Transformer-based Architectures , 2021, ArXiv.

[36]  Ari Levine,et al.  Which Trend Is Your Friend? , 2015 .

[37]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[38]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[39]  Thierry Roncalli,et al.  Trend Filtering Methods for Momentum Strategies , 2011 .

[40]  John K. Wald,et al.  Time series momentum and volatility scaling , 2016 .

[41]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[42]  The Failure of the Capital Asset Pricing Model (CAPM): An Update and Discussion , 2012 .

[43]  Xipeng Qiu,et al.  A Survey of Transformers , 2021, AI Open.

[44]  Stephen Roberts,et al.  Slow Momentum with Fast Reversion: A Trading Strategy Using Deep Learning and Changepoint Detection , 2021, The Journal of Financial Data Science.