Large Language Models are not Fair Evaluators
暂无分享,去创建一个
Yunbo Cao | Zhifang Sui | D. Zhu | Liang Chen | Peiyi Wang | Tianyu Liu | Lei Li | Binghuai Lin | Qi Liu | Dawei Zhu
[1] Omer Levy,et al. LIMA: Less Is More for Alignment , 2023, NeurIPS.
[2] Ethan Perez,et al. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , 2023, NeurIPS.
[3] Yiming Yang,et al. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , 2023, NeurIPS.
[4] Hongsheng Li,et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.
[5] Chunyuan Li,et al. Instruction Tuning with GPT-4 , 2023, ArXiv.
[6] Julian McAuley,et al. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.
[7] Sam Bowman. Eight Things to Know about Large Language Models , 2023, ArXiv.
[8] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[9] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[10] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.
[11] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[12] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[13] M. McHugh. Interrater reliability: the kappa statistic , 2012, Biochemia medica.
[14] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[15] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.