Do UD Trees Match Mention Spans in Coreference Annotations?

One can find dozens of data resources for various languages in which coreference – a relation between two or more expressions that refer to the same real-world entity – is manually annotated. One could also assume that such expressions usually constitute syntactically meaningful units; however, mention spans have been annotated simply by delimiting token intervals in most coreference projects, i.e., independently of any syntactic representation. We argue that it could be advantageous to make syntactic and coreference annotations convergent in the long term. We present a pilot empirical study focused on matches and mismatches between handannotated linear mention spans and automatically parsed syntactic trees that follow Universal Dependencies conventions. The study covers 9 datasets for 8 different languages.

[1]  Jerry R. Hobbs Resolving pronoun references , 1986 .

[2]  I. Mel Meaning-Text Models: A Recent Trend in Soviet Linguistics , 1981 .

[3]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[4]  David Maclean Carter A shallow processing approach to anaphor resolution , 1986 .

[5]  E. Koktová The meaning of the sentence in its semantic and pragmatic aspects , 1991 .

[6]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[7]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[8]  Yuan Ding,et al.  Automatic Learning of Parallel Dependency Treelet Pairs , 2004, IJCNLP.

[9]  Erhard W. Hinrichs,et al.  A Unified Representation for Morphological, Syntactic, Semantic, and Referential Annotations , 2005, FCA@ACL.

[10]  János Csirik,et al.  The Szeged Treebank , 2005, TSD.

[11]  Dekang Lin,et al.  Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.

[12]  Walter Daelemans,et al.  A Coreference Corpus and Resolution System for Dutch , 2008, LREC.

[13]  Christopher D. Manning,et al.  Joint Parsing and Named Entity Recognition , 2009, NAACL.

[14]  Maria Antònia Martí,et al.  AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan , 2010, Lang. Resour. Evaluation.

[15]  Michael T. Putnam,et al.  Catenae: Introducing a Novel Unit of Syntactic Analysis , 2012 .

[16]  Mateusz Kopec,et al.  Polish Coreference Corpus , 2013, LTC.

[17]  Daniel Zeman,et al.  Coordination Structures in Dependency Treebanks , 2013, ACL.

[18]  Maria Vasilyeva,et al.  Evaluating Anaphora and Coreference Resolution for Russian , 2014 .

[19]  Jörg Tiedemann,et al.  ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT , 2014, LREC.

[20]  Christopher D. Manning,et al.  Entity-Centric Coreference Resolution with Model Stacking , 2015, ACL.

[21]  Marie Mikulová,et al.  Coreference in Prague Czech-English Dependency Treebank , 2016, LREC.

[22]  Isabelle Tellier,et al.  Coreference Resolution for French Oral Data: Machine Learning Experiments with ANCOR , 2016, CICLing.

[23]  Frédéric Landragin,et al.  Description, modélisation et détection automatique des chaînes de référence (DEMOCRAT) , 2016 .

[24]  Amir Zeldes,et al.  The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[25]  Mitchell P. Marcus,et al.  OntoNotes : A Large Training Corpus for Enhanced Processing , 2017 .

[26]  Marie Mikulová,et al.  PDTSC 2.0 - Spoken Corpus with Rich Multi-layer Structural Annotation , 2017, TSD.

[27]  Michal Novák,et al.  Coreference Resolution System Not Only for Czech , 2017, ITAT.

[28]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[29]  Rita Butkiene,et al.  Coreference Annotation Scheme and Corpus for Lithuanian Language , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[30]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[31]  Veronika Vincze,et al.  SzegedKoref: A Hungarian Coreference Corpus , 2018, LREC.

[32]  Christian Hardmeier,et al.  ParCorFull: a Parallel Corpus Annotated with Full Coreference , 2018, LREC.

[33]  Daniel S. Weld,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP.

[34]  Ron Artstein,et al.  Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus , 2019, Natural Language Engineering.

[35]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[36]  Rudolf Rosa,et al.  Universal Dependencies according to BERT: both more specific and more general , 2020, FINDINGS.

[37]  Marie Mikulová,et al.  Prague Dependency Treebank - Consolidated 1.0 , 2020, LREC.

[38]  Manfred Stede,et al.  The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing , 2020, LREC.

[39]  Jiwei Li,et al.  CorefQA: Coreference Resolution as Query-based Span Prediction , 2020, ACL.

[40]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[41]  CorefUD 0 . 1 Coreference meets Universal Dependencies – a pilot experiment on harmonizing coreference datasets for 11 languages , 2021 .