Linguistically debatable or just plain wrong?

In linguistic annotation projects, we typically develop annotation guidelines to minimize disagreement. However, in this position paper we question whether we should actually limit the disagreements between annotators, rather than embracing them. We present an empirical analysis of part-of-speech annotated data sets that suggests that disagreements are systematic across domains and to a certain extend also across languages. This points to an underlying ambiguity rather than random errors. Moreover, a quantitative analysis of tag confusions reveals that the majority of disagreements are due to linguistically debatable cases rather than annotation errors. Specifically, we show that even in the absence of annotation guidelines only 2% of annotator choices are linguistically unmotivated.

[1]  David Jurgens,et al.  An analysis of ambiguity in word sense annotations , 2014, LREC.

[2]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[3]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[4]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[5]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[6]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[7]  Daniel Zeman Hard Problems of Tagset Conversion , 2009 .

[8]  Benoît Sagot,et al.  The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[9]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[10]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[11]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[14]  Martha Palmer,et al.  Criteria for the Manual Grouping of Verb Senses , 2007, LAW@ACL.

[15]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[16]  Mark W. Schmidt,et al.  Modeling annotator expertise: Learning when everybody knows a bit of something , 2010, AISTATS.

[17]  Dirk Hovy,et al.  Experiments with crowdsourced re-annotation of a POS tagging data set , 2014, ACL.

[18]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.