Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
暂无分享,去创建一个
Noah A. Smith | Noah A. Smith | M. Lewis | Luke Zettlemoyer | Tim Althoff | Suchin Gururangan | Tim Dettmers | Margaret Li
[1] Xi Victoria Lin,et al. Lifting the Curse of Multilinguality by Pre-training Modular Transformers , 2022, NAACL.
[2] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[3] Ari S. Morcos,et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.
[4] Matthew E. Peters,et al. Efficient Hierarchical Domain Adaptation for Pretrained Language Models , 2021, NAACL.
[5] Noah A. Smith,et al. Time Waits for No One! Analysis and Challenges of Temporal Misalignment , 2021, NAACL.
[6] Raphael Gontijo Lopes,et al. No One Representation to Rule Them All: Overlapping Features of Training Methods , 2021, ICLR.
[7] Hanie Sedghi,et al. The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , 2021, ICLR.
[8] Jong Wook Kim,et al. Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Noah A. Smith,et al. DEMix Layers: Disentangling Domains for Modular Language Modeling , 2021, NAACL.
[10] Yoav Goldberg,et al. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.
[11] Xi Victoria Lin,et al. Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, EMNLP.
[12] Colin Raffel,et al. Merging Models with Fisher-Weighted Averaging , 2021, NeurIPS.
[13] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.
[14] Myle Ott,et al. On Anytime Learning at Macroscale , 2021, CoLLAs.
[15] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.
[16] Jason Weston,et al. Bot-Adversarial Dialogue for Safe Conversational Agents , 2021, NAACL.
[17] Noah A. Smith,et al. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , 2021, NAACL.
[18] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[19] Dan Hendrycks,et al. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.
[20] David Bamman,et al. Characterizing English Variation across Social Media Communities with BERT , 2021, Transactions of the Association for Computational Linguistics.
[21] Phil Blunsom,et al. Mind the Gap: Assessing Temporal Generalization in Neural Language Models , 2021, NeurIPS.
[22] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[23] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[24] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[25] Xupeng Miao,et al. Dense-to-Sparse Gate for Mixture-of-Experts , 2021, ArXiv.
[26] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.
[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[28] Iryna Gurevych,et al. MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.
[29] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.
[30] Roee Aharoni,et al. Unsupervised Domain Clusters in Pretrained Language Models , 2020, ACL.
[31] Jeremy Blackburn,et al. The Pushshift Reddit Dataset , 2020, ICWSM.
[32] Daniel M. Roy,et al. Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.
[33] P. S. Castro,et al. Rigging the Lottery: Making All Tickets Winners , 2019, ICML.
[34] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[35] J. Yosinski,et al. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.
[36] Jason Weston,et al. Neural Text Generation with Unlikelihood Training , 2019, ICLR.
[37] Daniel S. Weld,et al. S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.
[38] Jianmo Ni,et al. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.
[39] Vladimir Eidelman,et al. BillSum: A Corpus for Automatic Summarization of US Legislation , 2019, EMNLP.
[40] Lav R. Varshney,et al. CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.
[41] Marta R. Costa-jussà,et al. Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.
[42] Luke Zettlemoyer,et al. Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.
[43] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[44] Pushmeet Kohli,et al. Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.
[45] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[46] Xin Wang,et al. Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.
[47] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.
[48] Weizhu Chen,et al. Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering , 2018, NAACL.
[49] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[50] Quoc V. Le,et al. A Simple Method for Commonsense Reasoning , 2018, ArXiv.
[51] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[52] Tiago M. Fragoso,et al. Bayesian Model Averaging: A Systematic Review and Conceptual Classification , 2015, 1509.08864.
[53] Alexander Herzog,et al. Database of parliamentary speeches in Ireland, 1919–2013 , 2017, 2017 International Conference on the Frontiers and Advances in Data Science (FADS).
[54] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[55] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.
[56] Brendan T. O'Connor,et al. Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.
[57] Jörg Tiedemann,et al. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.
[58] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[59] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[60] Martin Chodorow,et al. TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .
[61] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[62] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.
[63] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.
[64] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.
[65] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.
[66] Yoav Freund,et al. Boosting a weak learning algorithm by majority , 1990, COLT '90.
[67] Douglas Biber,et al. Variation across speech and writing: Methodology , 1988 .
[68] J. Rickford. Ethnicity as a Sociolinguistic Boundary , 1985 .