Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation

The definitions of two coreference scoring metrics- B3 and CEAF-are underspecified with respect to predicted, as opposed to key (or gold) mentions. Several variations have been proposed that manipulate either, or both, the key and predicted mentions in order to get a one-to-one mapping. On the other hand, the metric BLANC was, until recently, limited to scoring partitions of key mentions. In this paper, we (i) argue that mention manipulation for scoring predicted mentions is unnecessary, and potentially harmful as it could produce unintuitive results; (ii) illustrate the application of all these measures to scoring predicted mentions; (iii) make available an open-source, thoroughly-tested reference implementation of the main coreference evaluation measures; and (iv) rescore the results of the CoNLL-2011/2012 shared task systems with this implementation. This will help the community accurately measure and compare new end-to-end coreference resolution algorithms.

[1]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[2]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[3]  Christopher Potts,et al.  The Life and Death of Discourse Entities: Identifying Singleton Mentions , 2013, NAACL.

[4]  Yannick Versley,et al.  SemEval-2010 Task 1: Coreference Resolution in Multiple Languages , 2009, *SEMEVAL.

[5]  Michael Strube,et al.  Evaluation Metrics For End-to-End Coreference Resolution Systems , 2010, SIGDIAL Conference.

[6]  M. R E C A S E,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[7]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[8]  Vincent Ng,et al.  Coreference Resolution with World Knowledge , 2011, ACL.

[9]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[10]  Claire Cardie,et al.  Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art , 2009, ACL.

[11]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[12]  Chen Chen,et al.  Linguistically Aware Coreference Evaluation Metrics , 2013, IJCNLP.

[13]  Chen Lin,et al.  Temporal Annotation in the Clinical Domain , 2014, TACL.

[14]  Vincent Ng,et al.  Supervised Models for Coreference Resolution , 2009, EMNLP.

[15]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[16]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[17]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[18]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[19]  Gordana Ilic Holen Critical Reflections on Evaluation Practices in Coreference Resolution , 2013, NAACL.

[20]  Xiaoqiang Luo,et al.  An Extension of BLANC to System Mentions , 2014, ACL.

[21]  Mitchell P. Marcus,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007, International Conference on Semantic Computing (ICSC 2007).