论文信息 - Memformer: The Memory-Augmented Transformer

Memformer: The Memory-Augmented Transformer

Transformer models have obtained remarkable accomplishments in various NLP tasks. However, these models have efficiency issues on long sequences, as the complexity of their self-attention module scales quadratically with the sequence length. To remedy the limitation, we present Memformer, a novel language model that utilizes a single unified memory to encode and retrieve past information. It includes a new optimization scheme, Memory Replay Back-Propagation, which promotes long-range back-propagation through time with a significantly reduced memory requirement. Memformer achieves $\mathcal{O}(n)$ time complexity and $\mathcal{O}(1)$ space complexity in processing long sequences, meaning that the model can handle an infinite length sequence during inference. Our model is also compatible with other self-supervised tasks to further improve the performance on language modeling. Experimental results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer.

[1] Timothy P. Lillicrap,et al. Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[2] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[3] Sergio Gomez Colmenarejo,et al. Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[4] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[5] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[7] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[8] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[12] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[13] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[14] Kenneth Heafield,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers , 2016, Annual Meeting of the Association for Computational Linguistics.

[15] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[16] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[17] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.