Assessing Reliability on Annotations (1): Theoretical Considerations

This is the first part of a two-report mini-series focussing on issues in the evaluation of annotations. In this theoretically-oriented report we lay out the relevant statistical background for reliability studies, evaluate some pertaining approaches and also sketch some arguments that may lend themselves to the development of an original statistic. A description of the project background, including the documentation of the annotation scheme at stake and the empirical data collected, as well as results from the practical application of the relevant statistics and the discussion of our respective results are contained in the second, more empirically-oriented report [Lücking and Stegmann, 2005]. The following points are dealt with in detail here: we summarize and contribute to an argument by Gwet [2001] which indicates that the popular pi and kappa statistics [Carletta, 1996] are generally not appropriate for assessing the degree of agreement between raters on categorical type-ii data. We propose the use of AC1 [Gwet, 2001] instead, since it has desirable mathematical properties that make it more appropriate for assessing the results of expert raters in general. As far as type-i data are concerned, we make use of conventional correlation statistics which, unlike their AC1 and kappa cousins, do not deliver results that are adjusted with respect to agreements due to chance. Furthermore, we discuss issues in the interpretation of the results of the different statistics. Finally, we take up some loose ends from the previous chapters and sketch some advanced ideas pertaining to inter-rater agreement statistics. Therein, some differences as well as common ground concerning Gwet’s perspective and our own stance will be highlighted. We conclude with some preliminary suggestions regarding the development of the original statistic omega that will be different in nature from those discussed before.

[1]  Hannes Rieser,et al.  Pointing in Dialogue , 2004 .

[2]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[3]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[4]  Ipke Wachsmuth,et al.  Deixis in Multimodal Human Computer Interaction: An Interdisciplinary Approach , 2003, Gesture Workshop.

[5]  Gareth Evans,et al.  Can there be vague objects , 1978 .

[6]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[7]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[8]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[9]  andy. luecking,et al.  Assessing Reliability on Annotations (2): Statistical Results for the deikon Scheme , 2006 .

[10]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[11]  D. Lewis,et al.  Vague identity: Evans misunderstood , 1988 .

[12]  Stefan Kopp,et al.  MURML: A Multimodal Utterance Representation Markup Language for Conversational Agents , 2002 .

[13]  Statistical Support for the Study of Structures in Multi-Modal Dialogue : Inter-Rater Agreement and Synchronization , 2004 .

[14]  P. Sen,et al.  Introduction to bivariate and multivariate analysis , 1981 .

[15]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[16]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[17]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[18]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[19]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[20]  A new method of estimation of interobserver variation and its application to the radiological assessment of osteoarthrosis in hip joints. , 1988, Statistics in medicine.

[21]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[22]  Toni Rietveld,et al.  Statistical Techniques for the Study of Language and Language Behaviour , 1993 .