G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
暂无分享,去创建一个
Dan Iter | Ruochen Xu | Chenguang Zhu | Yichong Xu | Yang Liu | Shuo Wang
[1] Zhixu Li,et al. Is ChatGPT a Good NLG Evaluator? A Preliminary Study , 2023, ArXiv.
[2] C. Federmann,et al. Large Language Models Are State-of-the-Art Evaluators of Translation Quality , 2023, ArXiv.
[3] Pengfei Liu,et al. GPTScore: Evaluate as You Desire , 2023, NAACL.
[4] Percy Liang,et al. Benchmarking Large Language Models for News Summarization , 2023, ArXiv.
[5] Peng Liu,et al. Towards a Unified Multi-Dimensional Evaluator for Text Generation , 2022, EMNLP.
[6] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[7] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[8] Noah A. Smith,et al. Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand , 2021, NAACL.
[9] Weizhe Yuan,et al. BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.
[10] Liang Lin,et al. Towards Quantifiable Dialogue Coherence Evaluation , 2021, ACL.
[11] Dragomir R. Radev,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.
[12] J. C. Cheung,et al. Factual Error Correction for Abstractive Summarization Models , 2020, EMNLP.
[13] Mona T. Diab,et al. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.
[14] Maxine Eskenazi,et al. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.
[15] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.
[16] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[17] Richard Socher,et al. Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.
[18] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[19] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[20] Fei Liu,et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.
[21] Noah A. Smith,et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.
[22] Osmar R. Zaïane,et al. Evaluating Coherence in Dialogue Systems using Entailment , 2019, NAACL.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.
[25] Furu Wei,et al. Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.
[26] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.
[27] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.
[28] Ian S. Dunn,et al. Exploring the Limits , 2009 .
[29] Anja Belz,et al. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.
[30] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[31] Matthew Marge,et al. Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.
[32] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[33] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.