On the Usage of Kappa to Evaluate Agreement on Coding Tasks

In recent years, the Kappa coefficient of agreement has become the de facto standard to evaluate intercoder agreement in the discourse and dialogue processing community. Together with the adoption of this standard, researchers have adopted one specific scale to evaluate Kappa values, the one proposed in (Krippendorff, 1980). In this position paper, I highlight some issues that should be taken into account when evaluating Kappa values. Finally, I speculate on whether Kappa could be used as a measure to evaluate a system’s performance.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  Johanna D. Moore,et al.  An Empirical Investigation of Proposals in Collaborative Dialogues , 1998, ACL.

[3]  J. Searle What is a Speech Act , 1996 .

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Toni Rietveld,et al.  Statistical Techniques for the Study of Language and Language Behaviour , 1993 .

[6]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[7]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[8]  David R. Traum,et al.  Discourse Obligations in Dialogue Processing , 1994, ACL.

[9]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[10]  Rebecca J. Passonneau Applying Reliability Metrics to Co-Reference Annotation , 1997, ArXiv.

[11]  N. Andreasen,et al.  Reliability studies of psychiatric diagnosis. Theory and practice. , 1981, Archives of general psychiatry.

[12]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[13]  M. Black Philosophy in America , 1965 .

[14]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[15]  Johanna D. Moore,et al.  The agreement process: an empirical investigation of human-human computer-mediated collaborative dialogs , 2000, Int. J. Hum. Comput. Stud..