Exploring Demonstration Ensembling for In-context Learning

In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are irrelevant to the test example. Second, due to the input length limit of some transformer models, it might be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j and experiment on 12 language tasks. Our experiments show weighted max ensembling to outperform vanilla concatenation by as large as 2.4 average points. Code available at https://github.com/mukhal/icl-ensembling.

[1]  D. Schuurmans,et al.  Rationale-Augmented Ensembles in Language Models , 2022, ArXiv.

[2]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[3]  Jonathan Berant,et al.  Learning To Retrieve Prompts for In-Context Learning , 2021, NAACL.

[4]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[5]  Luke Zettlemoyer,et al.  Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[6]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[7]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[8]  Guanghui Qin,et al.  Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , 2021, NAACL.

[9]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[10]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[11]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[12]  Jannis Bulian,et al.  CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , 2020, ArXiv.

[13]  Leonardo Neves,et al.  TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification , 2020, FINDINGS.

[14]  Xavier Amatriain,et al.  Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs , 2020, KDD.

[15]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[16]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[17]  Peter Clark,et al.  QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships , 2018, AAAI.

[18]  Ona de Gibert,et al.  Hate Speech Dataset from a White Supremacy Forum , 2018, ALW.

[19]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[20]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[21]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[24]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[25]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[26]  L. Logeswaran,et al.  LEPUS: Prompt-based Unsupervised Multi-hop Reranking for Open-domain QA , 2022, ArXiv.

[27]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.