Squibs: From Annotator Agreement to Noise Models

This article discusses the transition from annotated data to a gold standard, that is, a subset that is sufficiently noise-free with high confidence. Unless appropriately reinterpreted, agreement coefficients do not indicate the quality of the data set as a benchmarking resource: High overall agreement is neither sufficient nor necessary to distill some amount of highly reliable data from the annotated material. A mathematical framework is developed that allows estimation of the noise level of the agreed subset of annotated data, which helps promote cautious benchmarking.

[1]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[2]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[3]  P. Albert,et al.  A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard , 2004, Biometrics.

[4]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[5]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[6]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[7]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[8]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[9]  Renata Vieira,et al.  An Empirically-based System for Processing Definite Descriptions , 2000, CL.

[10]  Edith Cohen,et al.  Learning noisy perceptrons by a perceptron in polynomial time , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[11]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[12]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[13]  Beata Beigman Klebanov,et al.  Learning with Annotation Noise , 2009, ACL.

[14]  Renata Vieira,et al.  A Corpus-based Investigation of Definite Description Use , 1997, CL.

[15]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[16]  P S Albert,et al.  Latent Class Modeling Approaches for Assessing Diagnostic Error without a Gold Standard: With Applications to p53 Immunohistochemical Assays in Bladder Tumors , 2001, Biometrics.

[17]  Julia Hirschberg,et al.  Characterizing and Predicting Corrections in Spoken Dialogue Systems , 2006, Computational Linguistics.

[18]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[19]  Alan M. Frieze,et al.  A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions , 1996, Algorithmica.

[20]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[21]  Dennis Reidsma,et al.  Exploiting ‘Subjective’ Annotations , 2008, COLING 2008.

[22]  Eyal Beigman,et al.  Analyzing Disagreements , 2008, COLING 2008.

[23]  M. Aickin Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa. , 1990, Biometrics.

[24]  Maria Lapata,et al.  The Disambiguation of Nominalizations , 2002, CL.

[25]  Melanie J. Martin Reliability and type of consumer health documents on the World Wide Web: an annotation study , 2010, J. Biomed. Semant..

[26]  S. Hui,et al.  Evaluation of diagnostic tests without gold standards , 1998, Statistical methods in medical research.

[27]  Dan I. Moldovan,et al.  Automatic Discovery of Part-Whole Relations , 2006, CL.

[28]  Malvina Nissim,et al.  Metonymy Resolution as a Classification Task , 2002, EMNLP.

[29]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.