暂无分享,去创建一个
[1] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.
[2] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[3] Elizabeth Clark,et al. Evaluation of Text Generation: A Survey , 2020, ArXiv.
[4] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[5] Jungo Kasai,et al. GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.
[6] Danqi Chen,et al. Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL/IJCNLP.
[7] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[8] Minlie Huang,et al. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation , 2020, EMNLP.
[9] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[10] Tom Feltwell,et al. Rethinking Engagement with Online News through Social and Visual Co-Annotation , 2018, CHI.
[11] Hugo Liu,et al. ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .
[12] Roger C. Schank,et al. Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .
[13] M. de Rijke,et al. Light-Weight Entailment Checking for Computational Semantics , 2001 .
[14] Laria Reynolds,et al. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.
[15] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[16] Joelle Pineau,et al. Language GANs Falling Short , 2018, ICLR.
[17] Yann Mathet,et al. The Unified and Holistic Method Gamma (γ) for Inter-Annotator Agreement Measure and Alignment , 2015, CL.
[18] Klaus Krippendorff,et al. Content Analysis: An Introduction to Its Methodology , 1980 .
[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[20] Mitesh M. Khapra,et al. A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..
[21] Yejin Choi,et al. MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation , 2021, ArXiv.
[22] Christopher D. Manning,et al. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.
[23] Noah A. Smith,et al. Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models , 2021, NAACL.
[24] Matthew Richardson,et al. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.
[25] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[26] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[27] Chris Callison-Burch,et al. RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text , 2020, EMNLP.
[28] Yejin Choi,et al. Learning to Write with Cooperative Discriminators , 2018, ACL.
[29] Hannaneh Hajishirzi,et al. Entity, Relation, and Event Extraction with Contextualized Span Representations , 2019, EMNLP.
[30] Ali Farhadi,et al. TuringAdvice: A Generative and Dynamic Evaluation of Language Use , 2021, NAACL.
[31] Jing Gu,et al. Perception Score, A Learned Metric for Open-ended Text Generation Evaluation , 2020, ArXiv.
[32] Dimitra Gkatzia,et al. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.
[33] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.
[34] Peter Henderson,et al. With Little Power Comes Great Responsibility , 2020, EMNLP.
[35] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[36] Siobhan Chapman. Logic and Conversation , 2005 .