Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective “token dropping” method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In particular, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens more efficiently if with limited computational resource. The dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

[1]  Richard Yuanzhe Pang,et al.  Amortized Noisy Channel Neural Machine Translation , 2021, INLG.

[2]  Richard Yuanzhe Pang,et al.  QuALITY: Question Answering with Long Input Texts, Yes! , 2021, NAACL.

[3]  Hao Tian,et al.  ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[4]  Shuaiqiang Wang,et al.  Pre-trained Language Model based Ranking in Baidu Search , 2021, KDD.

[5]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[6]  Zheng Cao,et al.  Reducing BERT Computation by Padding Removal and Curriculum Learning , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Gao Huang,et al.  Dynamic Neural Networks: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[9]  Minjia Zhang,et al.  Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[10]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[11]  Tie-Yan Liu,et al.  Taking Notes on the Fly Helps BERT Pre-training , 2020, ArXiv.

[12]  Guokun Lai,et al.  Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Lifu Tu,et al.  ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation , 2020, ACL.

[15]  Fandong Meng,et al.  Faster Depth-Adaptive Transformers , 2020, AAAI.

[16]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[17]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[18]  Hazem M. Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[19]  Lifu Tu,et al.  Improving Joint Training of Inference Networks and Structured Prediction Energy Networks , 2019, SPNLP.

[20]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[21]  Michael Auli,et al.  Depth-Adaptive Transformer , 2019, ICLR.

[22]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[23]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[26]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[27]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[28]  Samy Bengio,et al.  Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[29]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[30]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[31]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[32]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[33]  Yun Chen,et al.  A Stable and Effective Learning Strategy for Trainable Greedy Decoding , 2018, EMNLP.

[34]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[35]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[36]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[39]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[40]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[42]  Chen Xing,et al.  Taking Notes on the Fly Helps Language Pre-Training , 2021, ICLR.

[43]  M. M. Krell,et al.  Packing: Towards 2x NLP BERT Acceleration , 2021, ArXiv.

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.