ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces EXMIX (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using EXMIX, we study the effect of multi-task pre-training at the largest scale to date, and analyze cotraining transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose EXT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised EXMIX. Via extensive experiments, we show that EXT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of EXMIX. EXT5 also significantly improves sample efficiency while pre-training.

[1]  Yulia Tsvetkov,et al.  Ranking Transfer Languages with Pragmatically-Motivated Features for Multilingual Sentiment Analysis , 2020, ArXiv.

[2]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[3]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[4]  Dragomir R. Radev,et al.  DocNLI: A Large-scale Dataset for Document-level Natural Language Inference , 2021, FINDINGS.

[5]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[6]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[7]  Hai Zhao,et al.  Task-specific Objectives of Pre-trained Language Models for Dialogue Adaptation , 2020, ArXiv.

[8]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[9]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[10]  Alan S. Cowen,et al.  GoEmotions: A Dataset of Fine-Grained Emotions , 2020, ACL.

[11]  Tao Xie,et al.  Benchmarking Meaning Representations in Neural Semantic Parsing , 2020, EMNLP.

[12]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[13]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[14]  Joel R. Tetreault,et al.  This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation , 2019, ACL.

[15]  Verena Rieser,et al.  Semantic Noise Matters for Neural Natural Language Generation , 2019, INLG.

[16]  Sonal Gupta,et al.  Semantic Parsing for Task Oriented Dialog using Hierarchical Representations , 2018, EMNLP.

[17]  Eduard Hovy,et al.  StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer , 2021, NAACL.

[18]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[19]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[20]  Yejin Choi,et al.  UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark , 2021, AAAI.

[21]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[25]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[26]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Preslav Nakov,et al.  SemEval-2017 Task 4: Sentiment Analysis in Twitter , 2017, *SEMEVAL.

[29]  Graham Neubig,et al.  Choosing Transfer Languages for Cross-Lingual Learning , 2019, ACL.

[30]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[31]  Saif Mohammad,et al.  SemEval-2016 Task 6: Detecting Stance in Tweets , 2016, *SEMEVAL.

[32]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[33]  Harsh Jhamtani,et al.  Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models , 2017, Proceedings of the Workshop on Stylistic Variation.

[34]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[35]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[36]  Saif Mohammad,et al.  SemEval-2018 Task 1: Affect in Tweets , 2018, *SEMEVAL.

[37]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[38]  Yi Tay,et al.  HyperGrid Transformers: Towards A Single Model for Multiple Tasks , 2021, ICLR.

[39]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[40]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[41]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[42]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[43]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[44]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[45]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[46]  Thiago Castro Ferreira,et al.  The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020) , 2020, WEBNLG.

[47]  Matthew Henderson,et al.  Efficient Intent Detection with Dual Sentence Encoders , 2020, NLP4CONVAI.

[48]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ArXiv.

[49]  Omer Levy,et al.  Few-Shot Question Answering by Pretraining Span Selection , 2021, ACL/IJCNLP.

[50]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[51]  Verena Rieser,et al.  Benchmarking Natural Language Understanding Services for building Conversational Agents , 2019, IWSDS.

[52]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[53]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[54]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[55]  Jinfeng Rao,et al.  Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , 2021, ArXiv.

[56]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[57]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[58]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.

[59]  Dacheng Tao,et al.  On Better Exploring and Exploiting Task Relationships in Multitask Learning: Joint Model and Feature Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[60]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[61]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[62]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[63]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[64]  Samuel R. Bowman,et al.  When Do You Need Billions of Words of Pretraining Data? , 2020, ACL.

[65]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[66]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[67]  Christos Christodoulopoulos,et al.  The FEVER2.0 Shared Task , 2019, EMNLP.

[68]  Cong Yu,et al.  Quiz-Style Question Generation for News Stories , 2021, WWW.

[69]  Yejin Choi,et al.  CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning , 2020, EMNLP.

[70]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[71]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[72]  Jonathan Berant,et al.  What’s in Your Head? Emergent Behaviour in Multi-Task Transformer Models , 2021, EMNLP.

[73]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[74]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[75]  Véronique Hoste,et al.  SemEval-2018 Task 3: Irony Detection in English Tweets , 2018, *SEMEVAL.

[76]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[77]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[78]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[79]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[80]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[81]  Tao Yu,et al.  DART: Open-Domain Structured Data Record to Text Generation , 2020, NAACL.

[82]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[83]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[84]  Tal Linzen,et al.  COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[85]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[86]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[87]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[88]  Wei Xu,et al.  Neural CRF Model for Sentence Alignment in Text Simplification , 2020, ACL.

[89]  Lingjia Tang,et al.  An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction , 2019, EMNLP.

[90]  Ramakanth Pasunuru,et al.  AutoSeM: Automatic Task Selection and Mixing in Multi-Task Learning , 2019, NAACL.

[91]  Chelsea Finn,et al.  Efficiently Identifying Task Groupings for Multi-Task Learning , 2021, NeurIPS.

[92]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[93]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[94]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[95]  Zhaochen Guo,et al.  Robust Entity Linking via Random Walks , 2014, CIKM.

[96]  Horacio Saggion,et al.  SemEval 2018 Task 2: Multilingual Emoji Prediction , 2018, *SEMEVAL.

[97]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[98]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[99]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[100]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[101]  Mike Lewis,et al.  Question Answering Infused Pre-training of General-Purpose Contextualized Representations , 2021, ArXiv.

[102]  Joachim Bingel,et al.  Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[103]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[104]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[105]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[106]  Shashi Narayan,et al.  Creating Training Corpora for NLG Micro-Planners , 2017, ACL.

[107]  Leonardo Neves,et al.  TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification , 2020, FINDINGS.

[108]  Cong Yu,et al.  AgreeSum: Agreement-Oriented Multi-Document Summarization , 2021, FINDINGS.

[109]  Yulia Tsvetkov,et al.  Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models , 2020, ICLR.

[110]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[111]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[112]  Dario Amodei,et al.  An Empirical Model of Large-Batch Training , 2018, ArXiv.

[113]  Dragomir R. Radev,et al.  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[114]  Ming-Wei Chang,et al.  Unlocking Compositional Generalization in Pre-trained Models Using Intermediate Representations , 2021, ArXiv.

[115]  Wenhan Xiong,et al.  TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[116]  Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , 2021 .

[117]  Christy Dennison,et al.  Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets , 2021, NeurIPS.

[118]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[119]  Regina Barzilay,et al.  Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.

[120]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[121]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[122]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[123]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2020, AAAI.

[124]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[125]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, ArXiv.

[126]  Thomas Wolf,et al.  A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks , 2018, AAAI.

[127]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[128]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[129]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[130]  Subhransu Maji,et al.  Exploring and Predicting Transferability across NLP Tasks , 2020, EMNLP.

[131]  Eduard H. Hovy,et al.  Toward Semantics-Based Answer Pinpointing , 2001, HLT.

[132]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[133]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[134]  Dhiraj Amin,et al.  HATE SPEECH DETECTION , 2020 .

[135]  Bonnie Webber,et al.  Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , 2017 .

[136]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[137]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[138]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[139]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[140]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[141]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[142]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[143]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[144]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[145]  Claire Cardie,et al.  WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization , 2020, FINDINGS.