Learning part-of-speech taggers with inter-annotator agreement loss

In natural language processing (NLP) annotation projects, we use inter-annotator agreement measures and annotation guidelines to ensure consistent annotations. However, annotation guidelines often make linguistically debatable and even somewhat arbitrary decisions, and interannotator agreement is often less than perfect. While annotation projects usually specify how to deal with linguistically debatable phenomena, annotator disagreements typically still stem from these “hard” cases. This indicates that some errors are more debatable than others. In this paper, we use small samples of doublyannotated part-of-speech (POS) data for Twitter to estimate annotation reliability and show how those metrics of likely interannotator agreement can be implemented in the loss functions of POS taggers. We find that these cost-sensitive algorithms perform better across annotation projects and, more surprisingly, even on data annotated according to the same guidelines. Finally, we show that POS tagging models sensitive to inter-annotator agreement perform better on the downstream task of chunking.

[1]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[2]  Beata Beigman Klebanov,et al.  Squibs: From Annotator Agreement to Noise Models , 2009, CL.

[3]  Beata Beigman Klebanov,et al.  Learning with Annotation Noise , 2009, ACL.

[4]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[5]  Kazuhiro Seki,et al.  Clinical Entity Recognition Using Cost-Sensitive Structured Perceptron for NTCIR-10 MedNLP , 2013, NTCIR.

[6]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[7]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[8]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[9]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[10]  Richard Johansson,et al.  Training Parsers on Incompatible Treebanks , 2013, NAACL.

[11]  Sigrid Klerke,et al.  Down-stream effects of tree-to-dependency conversions , 2013, HLT-NAACL.

[12]  Dirk Hovy,et al.  Crowdsourcing and annotating NER for Twitter #drift , 2014, LREC.

[13]  Evelina Andersson,et al.  Cross-Framework Evaluation for Statistical Parsing , 2012, EACL.

[14]  Dennis Reidsma,et al.  Exploiting ‘Subjective’ Annotations , 2008, COLING 2008.

[15]  Roy Schwartz,et al.  Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation , 2011, ACL.

[16]  Anders Søgaard,et al.  Zipfian corruptions for robust POS tagging , 2013, NAACL.

[17]  Seong-Bae Park,et al.  A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors , 2012, ACL.

[18]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[19]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[20]  Dirk Hovy,et al.  When POS data sets don't add up: Combatting sample bias , 2014, LREC.

[21]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[22]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[26]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[27]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[28]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[29]  Josef van Genabith,et al.  From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0 , 2011, IJCNLP.

[30]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[31]  Anders Søgaard Part-of-speech tagging with antagonistic adversaries , 2013, ACL.

[32]  Daniel Zeman Hard Problems of Tagset Conversion , 2009 .

[33]  Fritz Wysotzki,et al.  Perceptron Based Learning with Example Dependent and Noisy Costs , 2003, ICML.

[34]  J. Guilford,et al.  A Note on the G Index of Agreement , 1964 .