论文信息 - Appendix - Recommended Statistical Significance Tests for NLP Tasks

Appendix - Recommended Statistical Significance Tests for NLP Tasks

Statistical significance testing plays an important role when drawing conclusions from experimental results in NLP papers. Particularly, it is a valuable tool when one would like to establish the superiority of one algorithm over another. This appendix complements the guide for testing statistical significance in NLP presented in \cite{dror2018hitchhiker} by proposing valid statistical tests for the common tasks and evaluation measures in the field.

Rotem Dror | Roi Reichart | Roi Reichart | Rotem Dror

[1] Alexander S. Yeh,et al. More accurate tests for the statistical significance of result differences , 2000, COLING.

[2] Lynette Hirschman,et al. A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[3] Heeyoung Lee,et al. Joint Entity and Event Coreference Resolution across Documents , 2012, EMNLP.

[4] Noah A. Smith,et al. Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[5] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[7] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[9] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10] Xiaoqiang Luo,et al. On Coreference Resolution Performance Metrics , 2005, HLT.

[11] Breck Baldwin,et al. Algorithms for Scoring Coreference Chains , 1998 .

[12] Klaus Krippendorff,et al. Computing Krippendorff's Alpha-Reliability , 2011 .

[13] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.