Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

[1]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[4]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[5]  F. Chung,et al.  The average distances in random graphs with given expected degrees , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Jimmy J. Lin,et al.  What Makes a Good Answer? The Role of Context in Question Answering , 2003, INTERACT.

[7]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[8]  Ani Nenkova,et al.  The Impact of Frequency on Summarization , 2005 .

[9]  Ryan Williams,et al.  A new algorithm for optimal 2-constraint satisfaction and its implications , 2005, Theor. Comput. Sci..

[10]  N. Linial,et al.  Expander Graphs and their Applications , 2006 .

[11]  Brendan T. O'Connor,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics , 2011 .

[12]  Shang-Hua Teng,et al.  Spectral Sparsification of Graphs , 2008, SIAM J. Comput..

[13]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[14]  Wang Liang,et al.  Segmenting DNA sequence into 'words' based on statistical language model , 2012, ArXiv.

[15]  Giovanna Ambrosini,et al.  EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era , 2012, Nucleic Acids Res..

[16]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[17]  Oren Weimann,et al.  Consequences of Faster Alignment of Sequences , 2014, ICALP.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[21]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[22]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[23]  Amir Abboud,et al.  Tight Hardness Results for LCS and Other Sequence Similarity Measures , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[24]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[25]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[26]  M. Gerstein,et al.  Role of non-coding sequence variants in cancer , 2016, Nature Reviews Genetics.

[27]  F. Benaych-Georges,et al.  Spectral radii of sparse random matrices , 2017, 1704.02945.

[28]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[31]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[32]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[33]  Ruochi Zhang,et al.  Exploiting sequence-based features for predicting enhancer–promoter interactions , 2017, Bioinform..

[34]  B. O’Malley,et al.  Histone Marks in the 'Driver's Seat': Functional Roles in Steering the Transcription Cycle. , 2017, Trends in biochemical sciences.

[35]  Diederik P. Kingma,et al.  GPU Kernels for Block-Sparse Weights , 2017 .

[36]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[37]  Clara Fannjiang,et al.  A deep learning approach to pattern recognition for short DNA sequences , 2018, bioRxiv.

[38]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[39]  O. Biham,et al.  Distribution of shortest path lengths in subcritical Erdős-Rényi networks. , 2018, Physical review. E.

[40]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[41]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[42]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[43]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[44]  K. Chou,et al.  iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. , 2019, Genomics.

[45]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[46]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[47]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[48]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[49]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[50]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[51]  Ramit Bharanikumar,et al.  PromoterPredict: sequence-based modelling of Escherichia coli σ70 promoter strength yields logarithmic dependence between promoter strength and sequence , 2018, bioRxiv.

[52]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[54]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[55]  Luyao Huang,et al.  Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence , 2019, NAACL.

[56]  Derek Miller,et al.  Leveraging BERT for Extractive Text Summarization on Lectures , 2019, ArXiv.

[57]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[58]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[59]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[60]  Tanasanee Phienthrakul,et al.  Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , 2019, ACL.

[61]  Zheng Zhang,et al.  BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[62]  Kil To Chong,et al.  DeePromoter: Robust Promoter Predictor Using Deep Learning , 2019, Front. Genet..

[63]  A. Knowles,et al.  Extremal eigenvalues of critical Erd\H{o}s-R\'enyi graphs , 2019, 1905.03243.

[64]  Lu Wang,et al.  BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[65]  Jifan Chen,et al.  Multi-hop Question Answering via Reasoning Chains , 2019, ArXiv.

[66]  Raul Vicente,et al.  ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples , 2019, bioRxiv.

[67]  Zhe Gan,et al.  Distilling the Knowledge of BERT for Text Generation , 2019, ArXiv.

[68]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[69]  Che-Lun Hung,et al.  NCNet: Deep Learning Network Models for Predicting Function of Non-coding DNA , 2019, Front. Genet..

[70]  Kalina Bontcheva,et al.  Team Bertha von Suttner at SemEval-2019 Task 4: Hyperpartisan News Detection using ELMo Sentence Representation Convolutional Network , 2019, *SEMEVAL.

[71]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[72]  Chen Zhang,et al.  Balanced Sparsity for Efficient DNN Inference on GPU , 2018, AAAI.

[73]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[74]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[75]  Kenton Lee,et al.  A BERT Baseline for the Natural Questions , 2019, ArXiv.

[76]  Pablo Barceló,et al.  On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.

[77]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[78]  Lisa Zhang,et al.  ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION , 2019 .

[79]  Cleber Zanchettin,et al.  Hierarchical Attentional Hybrid Neural Networks for Document Classification , 2019, ICANN.

[80]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[81]  F. Benaych-Georges,et al.  Largest eigenvalues of sparse inhomogeneous Erdős–Rényi graphs , 2017, The Annals of Probability.

[82]  Hao Wu,et al.  Long Document Classification From Local Word Glimpses via Recurrent Attention Learning , 2019, IEEE Access.

[83]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[84]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[85]  Lukasz Kaiser,et al.  Sample Efficient Text Summarization Using a Single Pre-Trained Transformer , 2019, ArXiv.

[86]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[87]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[88]  Benno Stein,et al.  SemEval-2019 Task 4: Hyperpartisan News Detection , 2019, *SEMEVAL.

[89]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[90]  Zhe Gan,et al.  Hierarchical Graph Network for Multi-hop Question Answering , 2019, EMNLP.

[91]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2019, FINDINGS.

[92]  Eunah Cho,et al.  Data Augmentation using Pre-trained Transformer Models , 2020, LIFELONGNLP.

[93]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[94]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[95]  Sashank J. Reddi,et al.  $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers , 2020, NeurIPS.

[96]  Wei Xu,et al.  Multi-hop Reading Comprehension across Documents with Path-based Graph Convolutional Network , 2020, IJCAI.

[97]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[98]  Sashank J. Reddi,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2019, ICLR.

[99]  C. Pal,et al.  On Extractive and Abstractive Neural Document Summarization with Transformer Language Models , 2020, EMNLP.

[100]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[101]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[102]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[103]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[104]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[105]  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[106]  Jieh Hsiang,et al.  Patent classification by fine-tuning BERT language model , 2020, World Patent Information.

[107]  Jiancheng Lv,et al.  RikiNet: Reading Wikipedia Pages for Natural Question Answering , 2020, ACL.

[108]  Grigorios Tsoumakas,et al.  A Divide-and-Conquer Approach to the Summarization of Academic Articles , 2020, ArXiv.

[109]  Santiago Ontañón,et al.  ETC: Encoding Long and Structured Data in Transformers , 2020, ArXiv.

[110]  Joshua J. Levy,et al.  MethylNet: an automated and modular deep learning approach for DNA methylation analysis , 2020, BMC Bioinformatics.

[111]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[112]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[113]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.