论文信息 - LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion - 字舞流文

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

Bill Yuchen Lin | Xiang Ren | Dongfu Jiang

[1] E. Xing,et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , 2023, ArXiv.

[2] Jimmy Ba,et al. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , 2023, NeurIPS.

[3] Julian McAuley,et al. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.

[4] Oskar van der Wal,et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[5] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[8] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[9] Shafiq R. Joty,et al. Towards Summary Candidates Fusion , 2022, EMNLP.

[10] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[11] Shafiq R. Joty,et al. SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization , 2022, ACL.

[12] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13] Weizhu Chen,et al. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[14] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[15] Yixin Liu,et al. SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization , 2021, ACL.

[16] Zhilin Yang,et al. GLM: General Language Model Pretraining with Autoregressive Blank Infilling , 2021, ACL.

[17] Jörg Tiedemann,et al. OPUS-MT – Building open translation services for the World , 2020, EAMT.

[18] Edouard Grave,et al. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[19] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[20] Bill Yuchen Lin,et al. CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning , 2020, FINDINGS.

[21] Peter J. Liu,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[22] Davis Liang,et al. Masked Language Model Scoring , 2019, ACL.

[23] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25] Marcin Pietron,et al. Ensemble Approach for Natural Language Question Answering Problem , 2019, 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW).

[26] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[28] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[29] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[30] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31] Yuhua Zhang,et al. A Chinese Question Answering Approach Integrating Count-Based and Embedding-Based Features , 2016, NLPCC/ICCPOL.

[32] Bowen Zhou,et al. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[33] Maja Popovic,et al. chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[34] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[35] Robert D. Nowak,et al. Active Ranking using Pairwise Comparisons , 2011, NIPS.

[36] Christopher J. C. Burges,et al. From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[37] Gregory N. Hullender,et al. Learning to rank using gradient descent , 2005, ICML.

[38] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[39] R. Graham,et al. Spearman's Footrule as a Measure of Disarray , 1977 .

[40] Lior Rokach,et al. Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..