Shortformer: Better Language Modeling using Shorter Inputs

We explore the benefits of decreasing the input length of transformers. First, we show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time and, surprisingly, gives a large improvement in perplexity. We then show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens (when generating sequences that are larger than the maximal length that the transformer can handle at once). Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. By combining these techniques, we increase training speed by 65%, make generation nine times faster, and substantially improve perplexity on WikiText-103, without adding any parameters.1

[1]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[2]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[3]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[6]  Omer Levy,et al.  Improving Transformer Models by Reordering their Sublayers , 2020, ACL.

[7]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[8]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[13]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[14]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[15]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[16]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[19]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[20]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[21]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[22]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[23]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.