Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL.

[1]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[2]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[3]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[4]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[5]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[6]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[7]  Tim Rocktäschel,et al.  Frustratingly Short Attention Spans in Neural Language Modeling , 2017, ICLR.

[8]  Santiago Ontañón,et al.  ETC: Encoding Long and Structured Data in Transformers , 2020, ArXiv.

[9]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[10]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[11]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[12]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[13]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[14]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Cheng Li,et al.  Semantic Text Matching for Long-Form Documents , 2019, WWW.

[17]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[18]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[19]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[20]  Pieter R Roelfsema,et al.  Parallel and serial grouping of image elements in visual perception. , 2010, Journal of experimental psychology. Human perception and performance.

[21]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[22]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[23]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[24]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[25]  Thomas Serre,et al.  Disentangling neural mechanisms for perceptual grouping , 2019, ICLR.

[26]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Chris Dyer,et al.  Learning to Discover, Ground and Use Words with Segmental Neural Language Models , 2018, ACL.

[29]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[30]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[31]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[32]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[33]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[34]  Thomas Serre,et al.  Learning long-range spatial dependencies with horizontal gated-recurrent units , 2018, NeurIPS.

[35]  Liu Yang,et al.  Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Document Matching , 2020, ArXiv.

[36]  Lucy J. Colwell,et al.  Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[37]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[38]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[39]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[40]  Jack W. Rae,et al.  Do Transformers Need Deep Long-Range Memory? , 2020, ACL.

[41]  Samuel R. Bowman,et al.  ListOps: A Diagnostic Dataset for Latent Tree Learning , 2018, NAACL.

[42]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[43]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[44]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[45]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[46]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[47]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[48]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[49]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[52]  Sara Hooker,et al.  The hardware lottery , 2020, Commun. ACM.