Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric

Interpretability and discriminative power are the two most basic requirements for an evaluation metric. In this paper, we report the mention identification effect in the B3, CEAF, and BLANC coreference evaluation metrics that makes it impossible to interpret their results properly. The only metric which is insensitive to this flaw is MUC, which, however, is known to be the least discriminative metric. It is a known fact that none of the current metrics are reliable. The common practice for ranking coreference resolvers is to use the average of three different metrics. However, one cannot expect to obtain a reliable score by averaging three unreliable metrics. We propose LEA, a Link-based Entity-Aware evaluation metric that is designed to overcome the shortcomings of the current evaluation metrics. LEA is available as branch LEA-scorer in the reference implementation of the official CoNLL scorer.

[1]  Michael Strube,et al.  Latent Structures for Coreference Resolution , 2015, TACL.

[2]  Gordana Ilic Holen Critical Reflections on Evaluation Practices in Coreference Resolution , 2013, NAACL.

[3]  Sameer Pradhan,et al.  Evaluation Metrics , 2007 .

[4]  Jason Weston,et al.  Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution , 2015, ACL.

[5]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[6]  Dan Roth,et al.  A Joint Framework for Coreference Resolution and Mention Head Detection , 2015, CoNLL.

[7]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[8]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[9]  Xiaoqiang Luo,et al.  An Extension of BLANC to System Mentions , 2014, ACL.

[10]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[11]  Pascal Denis,et al.  Global joint models for coreference resolution and named entity classification , 2009, Proces. del Leng. Natural.

[12]  Don Tuggener Coreference Resolution Evaluation for Higher Level Applications , 2014, EACL.

[13]  M. R E C A S E,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[14]  Xiaoqiang Luo,et al.  Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation , 2014, ACL.

[15]  Christopher D. Manning,et al.  Entity-Centric Coreference Resolution with Model Stacking , 2015, ACL.

[16]  Claire Cardie,et al.  Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art , 2009, ACL.

[17]  J. Treble,et al.  Computer Intensive Methods for Hypothesis Testing , 1990 .

[18]  Roland Stuckardt,et al.  Coreference-Based Summarization and Question Answering: a Case for High Precision Anaphor Resolution , 2003 .

[19]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[20]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .