Analyzing the Impact of Prevalence on the Evaluation of a Manual Annotation Campaign

This article details work aiming at evaluating the quality of the manual annotation of gene renaming couples in scientific abstracts, which generates sparse annotations. To evaluate these annotations, we compare the results obtained using the commonly advocated inter-annotator agreement coefficients such as S, κ and I€, the less known R, the weighted coefficients κI‰ and I± as well as the F-measure and the SER. We analyze to which extent they are relevant for our data. We then study the bias introduced by prevalence by changing the way the contingency table is built. We finally propose an original way to synthesize the results by computing distances between categories, based on the produced annotations.

[1]  Beatrice Alex,et al.  Agile Corpus Annotation in Practice: An Overview of Manual and Automatic Annotation of CVs , 2010, Linguistic Annotation Workshop.

[2]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[3]  Marion Laignelet,et al.  Repérer automatiquement les segments obsolescents à l’aide d’indices sémantiques et discursifs , 2009, JEPTALNRECITAL.

[4]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[5]  References , 1971 .

[6]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Rebecca J. Passonneau,et al.  Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation , 2006, LREC.

[9]  Malvina Nissim,et al.  The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text , 2006, LREC.

[10]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[11]  George Hripcsak,et al.  Measuring agreement in medical informatics reliability studies , 2002, J. Biomed. Informatics.

[12]  Olivier Galibert,et al.  Named and Specific Entity Detection in Varied Data: The Quæro Named Entity Baseline Evaluation , 2010, LREC.

[13]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[14]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[15]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[16]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[17]  R. H. Finn A Note on Estimating the Reliability of Categorical Data , 1970 .

[18]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[19]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[20]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.