Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 × more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

[1]  Xi Victoria Lin,et al.  Lifting the Curse of Multilinguality by Pre-training Modular Transformers , 2022, NAACL.

[2]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[3]  Ari S. Morcos,et al.  Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[4]  Matthew E. Peters,et al.  Efficient Hierarchical Domain Adaptation for Pretrained Language Models , 2021, NAACL.

[5]  Noah A. Smith,et al.  Time Waits for No One! Analysis and Challenges of Temporal Misalignment , 2021, NAACL.

[6]  Raphael Gontijo Lopes,et al.  No One Representation to Rule Them All: Overlapping Features of Training Methods , 2021, ICLR.

[7]  Hanie Sedghi,et al.  The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , 2021, ICLR.

[8]  Jong Wook Kim,et al.  Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Noah A. Smith,et al.  DEMix Layers: Disentangling Domains for Modular Language Modeling , 2021, NAACL.

[10]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[11]  Xi Victoria Lin,et al.  Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, EMNLP.

[12]  Colin Raffel,et al.  Merging Models with Fisher-Weighted Averaging , 2021, NeurIPS.

[13]  Ankur Bapna,et al.  Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.

[14]  Myle Ott,et al.  On Anytime Learning at Macroscale , 2021, CoLLAs.

[15]  Jason Weston,et al.  Hash Layers For Large Sparse Models , 2021, NeurIPS.

[16]  Jason Weston,et al.  Bot-Adversarial Dialogue for Safe Conversational Agents , 2021, NAACL.

[17]  Noah A. Smith,et al.  A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , 2021, NAACL.

[18]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[19]  Dan Hendrycks,et al.  CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.

[20]  David Bamman,et al.  Characterizing English Variation across Social Media Communities with BERT , 2021, Transactions of the Association for Computational Linguistics.

[21]  Phil Blunsom,et al.  Mind the Gap: Assessing Temporal Generalization in Neural Language Models , 2021, NeurIPS.

[22]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[23]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[24]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[25]  Xupeng Miao,et al.  Dense-to-Sparse Gate for Mixture-of-Experts , 2021, ArXiv.

[26]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[29]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[30]  Roee Aharoni,et al.  Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.

[31]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[32]  Daniel M. Roy,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[33]  P. S. Castro,et al.  Rigging the Lottery: Making All Tickets Winners , 2019, ICML.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[36]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[37]  Daniel S. Weld,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[38]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[39]  Vladimir Eidelman,et al.  BillSum: A Corpus for Automatic Summarization of US Legislation , 2019, EMNLP.

[40]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[41]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[42]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[43]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[44]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[45]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[46]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[47]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[48]  Weizhu Chen,et al.  Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering , 2018, NAACL.

[49]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[50]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[51]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[52]  Tiago M. Fragoso,et al.  Bayesian Model Averaging: A Systematic Review and Conceptual Classification , 2015, 1509.08864.

[53]  Alexander Herzog,et al.  Database of parliamentary speeches in Ireland, 1919–2013 , 2017, 2017 International Conference on the Frontiers and Advances in Data Science (FADS).

[54]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[55]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[56]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[57]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[58]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[60]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[61]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[62]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[63]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[64]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[65]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[66]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[67]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[68]  J. Rickford Ethnicity as a Sociolinguistic Boundary , 1985 .