Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
暂无分享,去创建一个
E. Xing | Joseph Gonzalez | Lianmin Zheng | Wei-Lin Chiang | Zhanghao Wu | Zhuohan Li | Yonghao Zhuang | Dacheng Li | I. Stoica | Zi Lin | Siyuan Zhuang | Haotong Zhang | Ying Sheng
[1] Yunbo Cao,et al. Large Language Models are not Fair Evaluators , 2023, ArXiv.
[2] S. Levine,et al. The False Promise of Imitating Proprietary LLMs , 2023, ArXiv.
[3] Luke Zettlemoyer,et al. QLoRA: Efficient Finetuning of Quantized LLMs , 2023, NeurIPS.
[4] Jimmy Ba,et al. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , 2023, NeurIPS.
[5] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[6] Zhi Rui Tam,et al. OpenAssistant Conversations - Democratizing Large Language Model Alignment , 2023, ArXiv.
[7] Weizhu Chen,et al. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, ArXiv.
[8] Meysam Alizadeh,et al. ChatGPT outperforms crowd workers for text-annotation tasks , 2023, Proceedings of the National Academy of Sciences of the United States of America.
[9] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[11] Haewoon Kwak,et al. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech , 2023, WWW.
[12] Quoc V. Le,et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.
[13] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[14] Christopher D. Manning,et al. Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.
[15] Dongyan Zhao,et al. MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation , 2022, ACL.
[16] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[17] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.
[18] Noah A. Smith,et al. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.
[19] Tom B. Brown,et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.
[20] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[21] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[22] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.
[23] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.
[24] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[25] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.
[26] Hannaneh Hajishirzi,et al. Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.
[27] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.
[28] Jaewoo Kang,et al. Look at the First Sentence: Position Bias in Question Answering , 2020, EMNLP.
[29] Ronan Le Bras,et al. WinoGrande , 2019, AAAI.
[30] Ali Farhadi,et al. HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.
[31] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.
[32] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.
[33] Marc Najork,et al. Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.
[34] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[35] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[36] Jonathon D. Brown,et al. Evaluations of Self and Others: Self-Enhancement Biases in Social Judgments , 1986 .
[37] N. Blunch. Position Bias in Multiple-Choice Questions , 1984 .
[38] Frank Sifei Luan,et al. SkyPilot: An Intercloud Broker for Sky Computing , 2023, NSDI.
[39] Priya Raghubir,et al. Center-of-inattention: Position biases in decision-making , 2006 .