Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

Ensembling BERT models often significantly improves accuracy, but at the cost of significantly more computation and memory footprint. In this work, we propose Multi-CLS BERT, a novel ensembling method for CLS-based prediction tasks that is almost as efficient as a single BERT model. Multi-CLS BERT uses multiple CLS tokens with a parameterization and objective that encourages their diversity. Thus instead of fine-tuning each BERT model in an ensemble (and running them all at test time), we need only fine-tune our single Multi-CLS BERT model (and run the one model at test time, ensembling just the multiple final CLS embeddings). To test its effectiveness, we build Multi-CLS BERT on top of a state-of-the-art pretraining method for BERT (Aroca-Ouellette and Rudzicz, 2020). In experiments on GLUE and SuperGLUE we show that our Multi-CLS BERT reliably improves both overall accuracy and confidence estimation. When only 100 training samples are available in GLUE, the Multi-CLS BERT_Base model can even outperform the corresponding BERT_Large model. We analyze the behavior of our Multi-CLS BERT, showing that it has many of the same characteristics and behavior as a typical BERT 5-way ensemble, but with nearly 4-times less computation and memory.

[1]  Y. Choi,et al.  Balancing Lexical and Semantic Quality in Abstractive Summarization , 2023, ACL.

[2]  Wei Wu,et al.  Robust Lottery Tickets for Pre-trained Language Models , 2022, ACL.

[3]  Hyung Won Chung,et al.  Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? , 2022, EMNLP.

[4]  Kentaro Inui,et al.  Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model , 2022, BIGSCIENCE.

[5]  Qun Liu,et al.  Exploring Extreme Parameter Compression for Pre-trained Language Models , 2022, ICLR.

[6]  T. Zhao,et al.  CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing , 2022, ACL.

[7]  Christos Tsirigotis,et al.  Simplicial Embeddings in Self-Supervised Learning and Downstream Classification , 2022, ICLR.

[8]  Richard Yuanzhe Pang,et al.  Token Dropping for Efficient BERT Pretraining , 2022, ACL.

[9]  Alessandro Moschitti,et al.  Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems , 2022, EMNLP.

[10]  Jun Huang,et al.  From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression , 2021, AAAI.

[11]  Jacob Eisenstein,et al.  The MultiBERTs: BERT Reproductions for Robustness Analysis , 2021, ICLR.

[12]  E. Chng,et al.  An Embarrassingly Simple Model for Dialogue Relation Extraction , 2020, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Xiaodong Liu,et al.  AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models , 2022, arXiv.org.

[14]  Gang Chen,et al.  SkipBERT: Efficient Inference with Shallow Layer Skipping , 2022, ACL.

[15]  Xuanjing Huang,et al.  Flooding-X: Improving BERT’s Resistance to Adversarial Attacks via Loss-Restricted Fine-Tuning , 2022, ACL.

[16]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[17]  Nikita Nangia,et al.  Scaling Laws vs Model Architectures : How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks , 2021 .

[18]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[19]  Yuanzhi Li,et al.  Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, ICLR.

[20]  Frank Rudzicz,et al.  On Losses for Modern Language Models , 2020, EMNLP.

[21]  Dan Iter,et al.  Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models , 2020, ACL.

[22]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[23]  Xipeng Qiu,et al.  Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[24]  Dustin Tran,et al.  BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning , 2020, ICLR.

[25]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[26]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[29]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[30]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[31]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[32]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[34]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[35]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[39]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[40]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[41]  Daniel Allen,et al.  The transformer. , 2000, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[42]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[43]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .