Beyond kappa: A review of interrater agreement measures

In 1960, Cohen introduced the kappa coefficient to measure chance‐corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main statistical approaches to this problem, descriptions and characterizations of the underlying models, and discussions of related statistical methodologies for estimation and confidence‐interval construction. The emphasis is on various practical scenarios and designs that underlie the development of these measures, and the interrelationships between them.

[1]  K. Pearson Mathematical contributions to the theory of evolution. VIII. On the correlation of characters not quantitatively measurable , 2022, Proceedings of the Royal Society of London.

[2]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  G. M. Tallis The Maximum Likelihood Estimation of Correlation from Contingency Tables , 1962 .

[5]  G. W. Snedecor Statistical Methods , 1964 .

[6]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .


[8]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[9]  M. A. Hamdan The equivalence of tetrachoric and maximum likelihood estimates of p in 2 × 2 tables , 1970 .

[10]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[11]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[12]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[13]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[14]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1973 .

[15]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[16]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[17]  J. R. Landis,et al.  A one-way components of variance model for categorical data , 1977 .

[18]  Joseph L. Fleiss,et al.  Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic , 1977 .

[19]  J. Fleiss,et al.  Inference About Weighted Kappa in the Non-Null Case , 1978 .

[20]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[21]  L. A. Goodman Simple Models for the Analysis of Association in Cross-Classifications Having Ordered Categories , 1979 .

[22]  Statistical methods , 1980 .

[23]  H. Kraemer,et al.  Extension of the kappa coefficient. , 1980, Biometrics.

[24]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[25]  J. Fleiss,et al.  Jackknifing functions of multinomial frequencies, with an application to a measure of concordance. , 1982, American journal of epidemiology.

[26]  D. J. Bartholomew,et al.  Latent variable models for ordered categorical data , 1983 .

[27]  Annette J. Dobson,et al.  General observer-agreement measures on individual subjects and groups of subjects , 1984 .

[28]  Martin A. Tanner,et al.  Modeling Agreement among Raters , 1985 .

[29]  M. Tanner,et al.  Modeling ordinal scale disagreement. , 1985, Psychological bulletin.

[30]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[31]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[32]  Alan Agresti,et al.  Mathematical and computer modelling reports: A model for agreement between ratings on an ordinal scale , 1988 .

[33]  R. Zwick,et al.  Another look at interrater agreement. , 1988, Psychological bulletin.

[34]  Rebecca Zwick,et al.  Another look at interrater agreement. , 1988, Psychological bulletin.

[35]  L. Irwig,et al.  Exposure-response relationship for a dichotomized response when the continuous underlying variable is not measured. , 1988, Statistics in medicine.

[36]  Shelby J. Haberman,et al.  A Stabilized Newton-Raphson Algorithm for Log-Linear Models for Frequency Tables Derived by Indirect Observation , 1988 .

[37]  H. Kraemer,et al.  2 x 2 kappa coefficients: measures of agreement or association. , 1989, Biometrics.

[38]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[39]  P D Sampson,et al.  Measuring interrater reliability among multiple raters: an example of methods for nominal data. , 1990, Statistics in medicine.

[40]  M. Aickin Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa. , 1990, Biometrics.

[41]  J S Uebersax,et al.  Latent class analysis of diagnostic agreement. , 1990, Statistics in medicine.

[42]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[43]  W. Barlow,et al.  A comparison of methods for calculating a stratified kappa. , 1990, Statistics in medicine.

[44]  N L Oden Estimating kappa from binocular data. , 1991, Statistics in medicine.

[45]  Y. Qu,et al.  Latent Variable Models for Clustered Dichotomous Data with Multiple Subclusters , 1992 .

[46]  Graham Dunn,et al.  Review papers : Design and analysis of reliability studies , 1992 .

[47]  A Agresti,et al.  Modelling patterns of agreement and disagreement , 1992, Statistical methods in medical research.

[48]  H. Kraemer,et al.  How many raters? Toward the most reliable diagnostic consensus. , 1992, Statistics in medicine.

[49]  R. Kessler,et al.  Familial influences on the clinical characteristics of major depression: a twin study , 1992, Acta psychiatrica Scandinavica.

[50]  A Donner,et al.  A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. , 1992, Statistics in medicine.

[51]  L. Corey,et al.  The Epidemiology of Pregnancy Complications and Outcome in a Norwegian Twin Population , 1992, Obstetrics and gynecology.

[52]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[53]  H J Schouten Estimating kappa from binocular data and comparing marginal probabilities. , 1993, Statistics in medicine.

[54]  T P Hutchinson,et al.  Focus on Psychometrics. Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. , 1993, Research in nursing & health.

[55]  A Agresti,et al.  Quasi-symmetric latent class models, with application to rater agreement. , 1993, Biometrics.

[56]  Y. Qu,et al.  Latent variable models for clustered ordinal data. , 1995, Biometrics.

[57]  M. Shoukri,et al.  Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. , 1995, Statistics in medicine.

[58]  Modelling covariate effects in observer agreement studies: the case of nominal scale agreement. , 1995, Statistics in medicine.

[59]  W. Barlow Measurement of interrater agreement with adjustment for covariates. , 1996, Biometrics.

[60]  M. Eliasziw,et al.  Testing the homogeneity of kappa statistics. , 1996, Biometrics.

[61]  S Kumanyika,et al.  A weighted concordance correlation coefficient for repeated measurement designs. , 1996, Biometrics.

[62]  A Donner,et al.  The statistical analysis of kappa statistics in multiple samples. , 1996, Journal of clinical epidemiology.

[63]  P. Magnus,et al.  Distribution and Heritability of Recurrent Ear Infections , 1997, The Annals of otology, rhinology, and laryngology.

[64]  A hierarchical approach to inferences concerning interobserver agreement for multinomial data. , 1997, Statistics in medicine.

[65]  H. Kraemer What is the 'right' statistical measure of twin concordance (or diagnostic reliability and validity)? , 1997, Archives of general psychiatry.

[66]  A K Manatunga,et al.  Assessing interrater agreement from dependent data. , 1997, Biometrics.