Zero-shot NLG evaluation through Pairware Comparisons with LLMs
暂无分享,去创建一个
[1] Donald Metzler,et al. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting , 2023, ArXiv.
[2] Zhixu Li,et al. Is ChatGPT a Good NLG Evaluator? A Preliminary Study , 2023, ArXiv.
[3] C. Federmann,et al. Large Language Models Are State-of-the-Art Evaluators of Translation Quality , 2023, EAMT.
[4] M. Gales,et al. MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization , 2023, ArXiv.
[5] Dragomir R. Radev,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.
[6] Maxine Eskenazi,et al. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.
[7] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.
[8] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[9] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[10] Joel R. Tetreault,et al. Discourse Coherence in the Wild: A Dataset, Evaluation and Methods , 2018, SIGDIAL Conference.
[11] Alon Lavie,et al. BLANC: Learning Evaluation Metrics for MT , 2005, HLT.
[12] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[13] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[14] D. Katz,et al. GPT-4 passes the bar exam , 2024, Philosophical Transactions of the Royal Society A.
[15] Anja Belz,et al. Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.
[16] L. Thurstone. A law of comparative judgment. , 1994 .