Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L)^{2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

[1]  Valentin Flunkert,et al.  DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks , 2017, International Journal of Forecasting.

[2]  John F. MacGregor,et al.  Some Recent Advances in Forecasting and Control , 1968 .

[3]  Michael Y. Hu,et al.  Forecasting with artificial neural networks: The state of the art , 1997 .

[4]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[5]  Douglas Eck,et al.  An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation , 2018, ArXiv.

[6]  Sander Bohte,et al.  Conditional Time Series Forecasting with Convolutional Neural Networks , 2017, 1703.04691.

[7]  Guokun Lai,et al.  Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks , 2017, SIGIR.

[8]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[9]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[13]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[14]  Diederik P. Kingma,et al.  GPU Kernels for Block-Sparse Weights , 2017 .

[15]  Nicolas Chapados,et al.  Effective Bayesian Modeling of Groups of Related Count Time Series , 2014, ICML.

[16]  Ugur Demiryurek,et al.  Deep Learning: A Generic Approach for Extreme Condition Traffic Forecasting , 2017, SDM.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Xifeng Yan,et al.  Multi-step Deep Autoregressive Forecasting with Latent States , 2019 .

[19]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[20]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[21]  Matthias W. Seeger,et al.  Deep State Space Models for Time Series Forecasting , 2018, NeurIPS.

[22]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[23]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[24]  K. Torkkola,et al.  A Multi-Horizon Quantile Recurrent Forecaster , 2017, 1711.11053.

[25]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Daniel Jurafsky,et al.  Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context , 2018, ACL.

[29]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  J. Yosinski,et al.  Time-series Extreme Event Forecasting with Neural Networks at Uber , 2017 .

[31]  Alex Smola,et al.  Deep Factors with Gaussian Processes for Forecasting , 2018, ArXiv.

[32]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[33]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[34]  Yisong Yue,et al.  Long-term Forecasting using Tensor-Train RNNs , 2017, ArXiv.

[35]  Melvin J. Hinich,et al.  Time Series Analysis by State Space Methods , 2001 .

[36]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[37]  Inderjit S. Dhillon,et al.  Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction , 2016, NIPS.

[38]  Ke Li,et al.  A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Yisong Yue,et al.  Long-term Forecasting using Higher Order Tensor RNNs , 2017 .

[40]  Evangelos Spiliotis,et al.  The M4 Competition: Results, findings, conclusion and way forward , 2018, International Journal of Forecasting.

[41]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.