What to Pre-Train on? Efficient Intermediate Task Selection

Intermediate task fine-tuning has been shown to culminate in large transfer gains across many NLP tasks. With an abundance of candidate datasets as well as pre-trained language models, it has become infeasible to experiment with all combinations to find the best transfer setting. In this work, we provide a comprehensive comparison of different methods for efficiently identifying beneficial tasks for intermediate transfer learning. We focus on parameter and computationally efficient adapter settings, highlight different data-availability scenarios, and provide expense estimates for each method. We experiment with a diverse set of 42 intermediate and 11 target English classification, multiple choice, question answering, and sequence tagging tasks. Our results demonstrate that efficient embedding based methods, which rely solely on the respective datasets, outperform computational expensive few-shot fine-tuning approaches. Our best methods achieve an average Regret@3 of 1% across all target tasks, demonstrating that we are able to efficiently identify the best datasets for intermediate training. 1

[1]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[2]  Leon Derczynski,et al.  Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition , 2017, NUT@EMNLP.

[3]  Alex Wang,et al.  Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[4]  Joan Puigcerver,et al.  Scalable Transfer Learning with Expert Models , 2020, ICLR.

[5]  Kedhar Nath Narahari,et al.  SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text , 2019, *SEMEVAL.

[6]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[7]  Tal Hassner,et al.  LEEP: A New Measure to Evaluate Transferability of Learned Representations , 2020, ICML.

[8]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[9]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[10]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[11]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[12]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[13]  Amos J. Storkey,et al.  Towards a Neural Statistician , 2016, ICLR.

[14]  Anna Rumshisky,et al.  Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks , 2020, AAAI.

[15]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[16]  Joachim Bingel,et al.  Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[17]  Iryna Gurevych,et al.  AdapterHub: A Framework for Adapting Transformers , 2020, EMNLP.

[18]  Pietro Perona,et al.  A linearized framework and a new benchmark for model selection for fine-tuning , 2021, ArXiv.

[19]  Subhransu Maji,et al.  Task2Vec: Task Embedding for Meta-Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[21]  Waleed Ammar,et al.  Structural Scaffolds for Citation Intent Classification in Scientific Publications , 2019, NAACL.

[22]  Noah A. Smith,et al.  Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.

[23]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[24]  Iryna Gurevych,et al.  MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale , 2020, EMNLP.

[25]  Iryna Gurevych,et al.  AdapterDrop: On the Efficiency of Adapters in Transformers , 2020, EMNLP.

[26]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[27]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[28]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[29]  Jonathan Berant,et al.  MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[30]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2021, EACL.

[31]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[32]  Johan Bos,et al.  Towards Universal Semantic Tagging , 2017, IWCS.

[33]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[34]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[35]  Gertjan van Noord,et al.  UDapter: Language Adaptation for Truly Universal Dependency Parsing , 2020, EMNLP.

[36]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[37]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[38]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[39]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[40]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[41]  Peter Clark,et al.  SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[42]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[43]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[44]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[45]  Goran Glavas,et al.  Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer , 2020, ArXiv.

[46]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[47]  Peter Clark,et al.  QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions , 2019, EMNLP.

[48]  Iryna Gurevych,et al.  UNKs Everywhere: Adapting Multilingual Language Models to New Scripts , 2021, EMNLP.

[49]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[50]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[51]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[52]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[55]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[56]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[57]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[58]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[59]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[60]  Mitesh M. Khapra,et al.  DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension , 2018, ACL.

[61]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[62]  Ce Zhang,et al.  Which Model to Transfer? Finding the Needle in the Growing Haystack , 2020, ArXiv.

[63]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[65]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[66]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[67]  Lars Schmidt-Thieme,et al.  Dataset2Vec: learning dataset meta-features , 2019, Data Mining and Knowledge Discovery.

[68]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[69]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[70]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[71]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[72]  Yukun Zhu,et al.  Ranking Neural Checkpoints , 2020, ArXiv.

[73]  Helen Yannakoudakis,et al.  Compositional Sequence Labeling Models for Error Detection in Learner Writing , 2016, ACL.

[74]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[75]  Subhransu Maji,et al.  Exploring and Predicting Transferability across NLP Tasks , 2020, EMNLP.

[76]  Iryna Gurevych,et al.  How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[77]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[78]  Goran Glavas,et al.  MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer , 2021, EMNLP.

[79]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[80]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[81]  Yi-Shin Chen,et al.  CARER: Contextualized Affect Representations for Emotion Recognition , 2018, EMNLP.

[82]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[83]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[84]  Tiejun Zhao,et al.  Constraint-Based Question Answering with Knowledge Graph , 2016, COLING.

[85]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[86]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.