Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms of interactions and syntactic representations. All models from this work are available at https://github.com/jxiw/BiGS.

[1]  Danqi Chen,et al.  Should You Mask 15% in Masked Language Modeling? , 2022, EACL.

[2]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[3]  Khaled Kamal Saab,et al.  Hungry Hungry Hippos: Towards Language Modeling with State Space Models , 2022, ICLR.

[4]  Luke Zettlemoyer,et al.  Mega: Moving Average Equipped Gated Attention , 2022, ICLR.

[5]  Scott W. Linderman,et al.  Simplified State Space Layers for Sequence Modeling , 2022, ICLR.

[6]  Behnam Neyshabur,et al.  Long Range Language Modeling via Gated State Spaces , 2022, ICLR.

[7]  Christopher Ré,et al.  On the Parameterization and Initialization of Diagonal State Space Models , 2022, NeurIPS.

[8]  Jonathan Berant,et al.  Diagonal State Spaces are as Effective as Structured State Spaces , 2022, NeurIPS.

[9]  Quoc V. Le,et al.  Transformer Quality in Linear Time , 2022, ICML.

[10]  Albert Gu,et al.  It's Raw! Audio Generation with State-Space Models , 2022, ICML.

[11]  Omer Levy,et al.  SCROLLS: Standardized CompaRison Over Long Language Sequences , 2022, EMNLP.

[12]  Albert Gu,et al.  Efficiently Modeling Long Sequences with Structured State Spaces , 2021, ICLR.

[13]  Joshua Ainslie,et al.  FNet: Mixing Tokens with Fourier Transforms , 2021, NAACL.

[14]  Yi Tay,et al.  Are Pretrained Convolutions Better than Pretrained Transformers? , 2021, ACL.

[15]  Omer Levy,et al.  How to Train BERT with an Academic Budget , 2021, EMNLP.

[16]  Hyung Won Chung,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[17]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[18]  C. Ré,et al.  HiPPO: Recurrent Memory with Optimal Polynomial Projections , 2020, NeurIPS.

[19]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[20]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[21]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Noah A. Smith,et al.  To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[25]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[26]  Samuel R. Bowman,et al.  Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments , 2019 .

[27]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[30]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[35]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[36]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[37]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[38]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[39]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.