A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods
暂无分享,去创建一个
[1] Ondrej Bojar,et al. Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.
[2] Dan Klein,et al. An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.
[3] John M. Conroy,et al. OCCAMS -- An Optimal Combinatorial Covering Algorithm for Multi-document Summarization , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.
[4] D. Bonett,et al. Sample size requirements for estimating pearson, kendall and spearman correlations , 2000 .
[5] George A. Vouros,et al. Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.
[6] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[7] John M. Conroy,et al. A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art , 2013, ACL.
[8] Hoa Trang Dang,et al. Overview of the TAC 2008 Update Summarization Task , 2008, TAC.
[9] Ido Dagan,et al. Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.
[10] Eduard H. Hovy,et al. Summarization Evaluation Using Transformed Basic Elements , 2008, TAC.
[11] F. Wilcoxon. Individual Comparisons by Ranking Methods , 1945 .
[12] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[13] S. Lewis,et al. Regression analysis , 2007, Practical Neurology.
[14] Yvette Graham,et al. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.
[15] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[16] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[17] S. Shapiro,et al. An Analysis of Variance Test for Normality (Complete Samples) , 1965 .
[18] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.
[19] Y. B. Wah,et al. Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .
[20] John M. Conroy,et al. An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.
[21] O. J. Dunn,et al. Comparison of Tests of the Equality of Dependent Correlation Coefficients , 1971 .
[22] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .
[23] Rotem Dror,et al. Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets , 2017, TACL.
[24] George Giannakopoulos,et al. Summarization System Evaluation Variations Based on N-Gram Graphs , 2010, TAC.
[25] Fei Liu,et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.
[26] Markus Freitag,et al. Results of the WMT20 Metrics Shared Task , 2020, WMT.
[27] Rotem Dror,et al. Statistical Significance Testing for Natural Language Processing , 2020, Synthesis Lectures on Human Language Technologies.
[28] Dan Roth,et al. SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics , 2020, NLPOSS.
[29] Richard Socher,et al. SummEval: Re-evaluating Summarization Evaluation , 2020, ArXiv.
[30] Timothy Baldwin,et al. Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.
[31] Graham Neubig,et al. Re-evaluating Evaluation in Text Summarization , 2020, EMNLP.
[32] Yang Liu,et al. Non-Expert Evaluation of Summarization Systems is Risky , 2010, Mturk@HLT-NAACL.
[33] Bowen Zhou,et al. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.
[34] George Giannakopoulos,et al. Summary Evaluation: Together We Stand NPowER-ed , 2013, CICLing.
[35] Daniel Khashabi,et al. Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses , 2020, ACL.
[36] Dan Roth,et al. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary , 2021, Transactions of the Association for Computational Linguistics.
[37] Dianne P. O'Leary,et al. Ranking Human and Machine Summarization Systems , 2011, EMNLP.
[38] Iryna Gurevych,et al. Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.
[39] M. Kendall. Statistical Methods for Research Workers , 1937, Nature.
[40] John M. Conroy,et al. Better Metrics to Automatically Predict the Quality of a Text Summary , 2012, Algorithms.