Squibs and Discussions: The Kappa Statistic: A Second Look

In recent years, the kappa coefficient of agreement has become the de facto standard for evaluating intercoder agreement for tagging tasks. In this squib, we highlight issues that affect and that the community has largely neglected. First, we discuss the assumptions underlying different computations of the expected agreement component of . Second, we discuss how prevalence and bias affect the measure.

[1]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[2]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[5]  P. Romano,et al.  Letter to the Editor/In Reply , 1976 .

[6]  J J Bartko,et al.  ON THE METHODS AND THEORY OF RELIABILITY , 1976, The Journal of nervous and mental disease.

[7]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[8]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[9]  N. Andreasen,et al.  Reliability studies of psychiatric diagnosis. Theory and practice. , 1981, Archives of general psychiatry.

[10]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[11]  C. Berry The κ Statistic , 1992 .

[12]  Berry Cc The kappa statistic. , 1992, JAMA.

[13]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[14]  Toni Rietveld,et al.  Statistical Techniques for the Study of Language and Language Behaviour , 1993 .

[15]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[16]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[17]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[18]  Johanna D. Moore,et al.  The agreement process: an empirical investigation of human-human computer-mediated collaborative dialogs , 2000, Int. J. Hum. Comput. Stud..

[19]  Barbara Di Eugenio,et al.  On the Usage of Kappa to Evaluate Agreement on Coding Tasks , 2000, LREC.

[20]  P. Shrout,et al.  Fleiss, Joseph L † , 2005 .