论文信息 - Evaluer des annotations manuelles dispersées : les coefficients sont-ils suffisants pour estimer l'accord inter-annotateurs ?

Evaluer des annotations manuelles dispersées : les coefficients sont-ils suffisants pour estimer l'accord inter-annotateurs ?

This article details work aiming at evaluating the quality of the manual annotation of gene renaming relations in scientific abstracts, which generates dispersed annotations. To evaluate these annotations, we computed and compared the results obtained using the commonly advocated inter-annotator agreement coefficients like κ (Cohen, 1960) or π (Scott, 1955) and analyzed to which extent they are relevant for our data. We also studied the different weighting computations applicable to κω (Cohen, 1968) and α (Krippendorff, 1980, 2004) and defined a way to compute distances between categories based on the produced annotations. We then propose a first approach to estimate the bias introduced by prevalence.

Claire François | Karën Fort | Maha Ghribi

[1] Malvina Nissim,et al. The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text , 2006, LREC.

[2] Barbara Di Eugenio,et al. Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[3] Ron Artstein,et al. Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[4] George Hripcsak,et al. Measuring agreement in medical informatics reliability studies , 2002, J. Biomed. Informatics.

[5] W. A. Scott,et al. Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[6] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[7] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[8] Jean Carletta,et al. Squibs: Reliability Measurement without Limits , 2008, CL.

[9] Jacob Cohen,et al. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[10] Karën Fort,et al. Vers une méthodologie d’annotation des entités nommées en corpus ? , 2009, JEPTALNRECITAL.

[11] Marion Laignelet,et al. Repérer automatiquement les segments obsolescents à l’aide d’indices sémantiques et discursifs , 2009, JEPTALNRECITAL.

[12] P. Bayerl,et al. Measuring the reliability of manual annotations of speech corpora , 2004, Speech Prosody 2004.

[13] R. H. Finn. A Note on Estimating the Reliability of Categorical Data , 1970 .

[14] S. Siegel,et al. Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[15] Klaus Krippendorff,et al. Content Analysis: An Introduction to Its Methodology , 1980 .