Some Empirical Evidence for Annotation Noise in a Benchmarked Dataset

A number of recent articles in computational linguistics venues called for a closer examination of the type of noise present in annotated datasets used for benchmarking (Reidsma and Carletta, 2008; Beigman Klebanov and Beigman, 2009). In particular, Beigman Klebanov and Beigman articulated a type of noise they call annotation noise and showed that in worst case such noise can severely degrade the generalization ability of a linear classifier (Beigman and Beigman Klebanov, 2009). In this paper, we provide quantitative empirical evidence for the existence of this type of noise in a recently benchmarked dataset. The proposed methodology can be used to zero in on unreliable instances, facilitating generation of cleaner gold standards for benchmarking.

[1]  P. Albert,et al.  A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard , 2004, Biometrics.

[2]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Luis von Ahn Games with a Purpose , 2006, Computer.

[5]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[6]  M. Espeland,et al.  Using latent class models to characterize and assess relative error in discrete measurements. , 1989, Biometrics.

[7]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[8]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[9]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[10]  I Yang,et al.  Latent variable modeling of diagnostic accuracy. , 1997, Biometrics.

[11]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[12]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[13]  Beata Beigman Klebanov,et al.  Squibs: From Annotator Agreement to Noise Models , 2009, CL.

[14]  Beata Beigman Klebanov,et al.  Learning with Annotation Noise , 2009, ACL.

[15]  Udo Kruschwitz,et al.  ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation , 2008, LREC.

[16]  P S Albert,et al.  Latent Class Modeling Approaches for Assessing Diagnostic Error without a Gold Standard: With Applications to p53 Immunohistochemical Assays in Bladder Tumors , 2001, Biometrics.