LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

[1]  E. Xing,et al.  Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , 2023, ArXiv.

[2]  Jimmy Ba,et al.  AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , 2023, NeurIPS.

[3]  Julian McAuley,et al.  Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.

[4]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[5]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[8]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[9]  Shafiq R. Joty,et al.  Towards Summary Candidates Fusion , 2022, EMNLP.

[10]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[11]  Shafiq R. Joty,et al.  SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization , 2022, ACL.

[12]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13]  Weizhu Chen,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[14]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[15]  Yixin Liu,et al.  SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization , 2021, ACL.

[16]  Zhilin Yang,et al.  GLM: General Language Model Pretraining with Autoregressive Blank Infilling , 2021, ACL.

[17]  Jörg Tiedemann,et al.  OPUS-MT – Building open translation services for the World , 2020, EAMT.

[18]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[19]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[20]  Bill Yuchen Lin,et al.  CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning , 2020, FINDINGS.

[21]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[22]  Davis Liang,et al.  Masked Language Model Scoring , 2019, ACL.

[23]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Marcin Pietron,et al.  Ensemble Approach for Natural Language Question Answering Problem , 2019, 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW).

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[28]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[29]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[30]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31]  Yuhua Zhang,et al.  A Chinese Question Answering Approach Integrating Count-Based and Embedding-Based Features , 2016, NLPCC/ICCPOL.

[32]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[33]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[34]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[35]  Robert D. Nowak,et al.  Active Ranking using Pairwise Comparisons , 2011, NIPS.

[36]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[37]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[38]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[39]  R. Graham,et al.  Spearman's Footrule as a Measure of Disarray , 1977 .

[40]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..