论文信息 - Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

Ensembling BERT models often significantly improves accuracy, but at the cost of significantly more computation and memory footprint. In this work, we propose Multi-CLS BERT, a novel ensembling method for CLS-based prediction tasks that is almost as efficient as a single BERT model. Multi-CLS BERT uses multiple CLS tokens with a parameterization and objective that encourages their diversity. Thus instead of fine-tuning each BERT model in an ensemble (and running them all at test time), we need only fine-tune our single Multi-CLS BERT model (and run the one model at test time, ensembling just the multiple final CLS embeddings). To test its effectiveness, we build Multi-CLS BERT on top of a state-of-the-art pretraining method for BERT (Aroca-Ouellette and Rudzicz, 2020). In experiments on GLUE and SuperGLUE we show that our Multi-CLS BERT reliably improves both overall accuracy and confidence estimation. When only 100 training samples are available in GLUE, the Multi-CLS BERT_Base model can even outperform the corresponding BERT_Large model. We analyze the behavior of our Multi-CLS BERT, showing that it has many of the same characteristics and behavior as a typical BERT 5-way ensemble, but with nearly 4-times less computation and memory.

A. McCallum | Kathryn Ricci | Haw-Shiuan Chang | Ruei-Yao Sun

[1] Y. Choi,et al. Balancing Lexical and Semantic Quality in Abstractive Summarization , 2023, ACL.

[2] Wei Wu,et al. Robust Lottery Tickets for Pre-trained Language Models , 2022, ACL.

[3] Hyung Won Chung,et al. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? , 2022, EMNLP.

[4] Kentaro Inui,et al. Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model , 2022, BIGSCIENCE.

[5] Qun Liu,et al. Exploring Extreme Parameter Compression for Pre-trained Language Models , 2022, ICLR.

[6] T. Zhao,et al. CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing , 2022, ACL.

[7] Christos Tsirigotis,et al. Simplicial Embeddings in Self-Supervised Learning and Downstream Classification , 2022, ICLR.

[8] Richard Yuanzhe Pang,et al. Token Dropping for Efficient BERT Pretraining , 2022, ACL.

[9] Alessandro Moschitti,et al. Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems , 2022, EMNLP.

[10] Jun Huang,et al. From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression , 2021, AAAI.

[11] Jacob Eisenstein,et al. The MultiBERTs: BERT Reproductions for Robustness Analysis , 2021, ICLR.

[12] E. Chng,et al. An Embarrassingly Simple Model for Dialogue Relation Extraction , 2020, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Xiaodong Liu,et al. AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models , 2022, arXiv.org.

[14] Gang Chen,et al. SkipBERT: Efficient Inference with Shallow Layer Skipping , 2022, ACL.

[15] Xuanjing Huang,et al. Flooding-X: Improving BERT’s Resistance to Adversarial Attacks via Loss-Restricted Fine-Tuning , 2022, ACL.

[16] Jacob Eisenstein,et al. Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[17] Nikita Nangia,et al. Scaling Laws vs Model Architectures : How does Inductive Bias Inﬂuence Scaling? An Extensive Empirical Study on Language Tasks , 2021 .

[18] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[19] Yuanzhi Li,et al. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, ICLR.

[20] Frank Rudzicz,et al. On Losses for Modern Language Models , 2020, EMNLP.

[21] Dan Iter,et al. Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models , 2020, ACL.

[22] M. Zaharia,et al. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[23] Xipeng Qiu,et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[24] Dustin Tran,et al. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning , 2020, ICLR.

[25] Hao Tian,et al. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[26] Mirella Lapata,et al. Text Summarization with Pretrained Encoders , 2019, EMNLP.

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[29] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[30] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[31] Dan Roth,et al. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[32] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[34] Honglak Lee,et al. An efficient framework for learning sentence representations , 2018, ICLR.

[35] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[36] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[39] Zornitsa Kozareva,et al. SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[40] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[41] Daniel Allen,et al. The transformer. , 2000, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[42] David Mackay,et al. Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[43] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .