MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
暂无分享,去创建一个
Yejin Choi | John Thickstun | Sean Welleck | Krishna Pillutla | Rowan Zellers | Zaid Harchaoui | Swabha Swayamdipta | Yejin Choi | John Thickstun | Z. Harchaoui | Rowan Zellers | Krishna Pillutla | S. Welleck | Swabha Swayamdipta | Zaïd Harchaoui | Sean Welleck
[1] Stanislau Semeniuta,et al. On Accurate Evaluation of GANs for Language Generation , 2018, ArXiv.
[2] Geoffrey E. Hinton,et al. A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..
[3] Daphne Ippolito,et al. Trading Off Diversity and Quality in Natural Language Generation , 2020, HUMEVAL.
[4] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[5] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[6] Anette Frank,et al. Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR , 2020, EACL.
[7] Mamoru Komachi,et al. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation , 2018, WMT.
[8] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[9] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.
[10] D. Hunter. MM algorithms for generalized Bradley-Terry models , 2003 .
[11] Lei Zheng,et al. Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.
[12] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[13] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[14] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[15] N. Vayatis,et al. Overlaying Classifiers: A Practical Approach to Optimal Scoring , 2010 .
[16] Marzena Karpinska,et al. The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation , 2021, EMNLP.
[17] Ramón Fernández Astudillo,et al. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.
[18] Noah A. Smith,et al. Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.
[19] Mehryar Mohri,et al. Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.
[20] T. Han,et al. Mathematics of information and coding , 2001 .
[21] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[22] Jianfeng Gao,et al. PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking , 2020, EMNLP.
[23] Yejin Choi,et al. Divergence Frontiers for Generative Models: Sample Complexity, Quantization Level, and Frontier Integral , 2021, arXiv.org.
[24] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[25] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[26] Joelle Pineau,et al. The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.
[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[28] Tal August,et al. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.
[29] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[30] Joelle Pineau,et al. Language GANs Falling Short , 2018, ICLR.
[31] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.
[32] Mitesh M. Khapra,et al. A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..
[33] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[34] Stéphan Clémençon,et al. Nonparametric estimation of the precision-recall curve , 2009, ICML '09.
[35] Elizabeth Clark,et al. Evaluation of Text Generation: A Survey , 2020, ArXiv.
[36] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[37] André F. T. Martins,et al. Sparse Text Generation , 2020, EMNLP.
[38] Anastasia Shimorina,et al. The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP , 2021, ArXiv.
[39] Olivier Bachem,et al. Assessing Generative Models via Precision and Recall , 2018, NeurIPS.
[40] Jaakko Lehtinen,et al. Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.
[41] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[42] Peter A. Flach,et al. Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .
[43] David Lopez-Paz,et al. Revisiting Classifier Two-Sample Tests , 2016, ICLR.
[44] Arthur Gretton,et al. Demystifying MMD GANs , 2018, ICLR.
[45] Jason Weston,et al. Neural Text Generation with Unlikelihood Training , 2019, ICLR.
[46] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[47] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[48] Minlie Huang,et al. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation , 2020, EMNLP.
[49] Cordelia Schmid,et al. Spreading vectors for similarity search , 2018, ICLR.
[50] Fei Liu,et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.
[51] Perttu Hämäläinen,et al. Deep Residual Mixture Models , 2020, ArXiv.
[52] Kaisa Miettinen,et al. Nonlinear multiobjective optimization , 1998, International series in operations research and management science.
[53] Simon Mille,et al. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing , 2020, INLG.
[54] C. Villani. Topics in Optimal Transportation , 2003 .
[55] Wilker Aziz,et al. Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.
[56] Kyunghyun Cho,et al. Consistency of a Recurrent Language Model With Respect to Incomplete Decoding , 2020, EMNLP.
[57] Fabrizio Silvestri,et al. How Decoding Strategies Affect the Verifiability of Generated Text , 2020, FINDINGS.
[58] Chris Callison-Burch,et al. Human and Automatic Detection of Generated Text , 2019, ArXiv.
[59] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[60] Susan A. Murphy,et al. Monographs on statistics and applied probability , 1990 .
[61] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.
[62] Olivier Bachem,et al. Precision-Recall Curves Using Information Divergence Frontiers , 2019, AISTATS.
[63] J. Marden. Analyzing and Modeling Rank Data , 1996 .
[64] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.
[65] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.