A Comparison of Reliability Coefficients for Ordinal Rating Scales

Kappa coefficients are commonly used for quantifying reliability on a categorical scale, whereas correlation coefficients are commonly applied to assess reliability on an interval scale. Both types of coefficients can be used to assess the reliability of ordinal rating scales. In this study, we compare seven reliability coefficients for ordinal rating scales: the kappa coefficients included are Cohen’s kappa, linearly weighted kappa, and quadratically weighted kappa; the correlation coefficients included are intraclass correlation ICC(3,1), Pearson’s correlation, Spearman’s rho, and Kendall’s tau-b. The primary goal is to provide a thorough understanding of these coefficients such that the applied researcher can make a sensible choice for ordinal rating scales. A second aim is to find out whether the choice of the coefficient matters. We studied to what extent we reach the same conclusions about inter-rater reliability with different coefficients, and to what extent the coefficients measure agreement in a similar way, using analytic methods, and simulated and empirical data. Using analytical methods, it is shown that differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are small. Furthermore, using simulated and empirical data, it is shown that differences between all reliability coefficients tend to increase if agreement between the raters increases. Moreover, for the data in this study, the same conclusion about inter-rater reliability was reached in virtually all cases with the four correlation coefficients. In addition, using quadratically weighted kappa, we reached a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Moreover, the four correlation coefficients and quadratically weighted kappa tend to measure agreement in a similar way: their values are very highly correlated for the data in this study.

[1]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[2]  M. Warrens Transforming intraclass correlation coefficients with the Spearman-Brown formula. , 2017, Journal of clinical epidemiology.

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  M. Mukaka,et al.  Statistics corner: A guide to appropriate use of correlation coefficient in medical research. , 2012, Malawi medical journal : the journal of Medical Association of Malawi.

[5]  C. Schuster,et al.  Dispersion-weighted kappa: An integrative framework for metric and nominal scale agreement coefficients , 2005 .

[6]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[7]  K. Krippendorff Quantitative guidelines for communicable disease control programs. , 1978, Biometrics.

[8]  H. Brenner,et al.  Dependence of Weighted Kappa Coefficients on the Number of Categories , 1996, Epidemiology.

[9]  Janis E. Johnston,et al.  Stuart’s tau measure of effect size for ordinal variables: Some methodological considerations , 2009, Behavior research methods.

[10]  H. Kiers,et al.  Kappa Coefficients for Missing Data , 2019, Educational and psychological measurement.

[11]  C. Glas,et al.  Changes in teachers’ instructional skills during an intensive data-based decision making intervention , 2017 .

[12]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[13]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[14]  N D Holmquist,et al.  Variability in classification of carcinoma in situ of the uterine cervix. , 1967, Archives of pathology.

[15]  W. Hoeffding,et al.  Rank Correlation Methods , 1949 .

[16]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[17]  J. Koval,et al.  Interval estimation for Cohen's kappa as a measure of agreement. , 2000, Statistics in medicine.

[18]  Adelin Albert,et al.  A note on the linearly weighted kappa coefficient for ordinal scales , 2009 .

[19]  Otto S Hoekstra,et al.  Clinicians are right not to like Cohen’s κ , 2013, BMJ : British Medical Journal.

[20]  S. Bangdiwala,et al.  Interpretation of Kappa and B statistics measures of agreement , 1997 .

[21]  C. Ko,et al.  Toward robust information: data quality and inter-rater reliability in the American College of Surgeons National Surgical Quality Improvement Program. , 2010, Journal of the American College of Surgeons.

[22]  S. Leekam,et al.  The Diagnostic Interview for Social and Communication Disorders: background, inter-rater reliability and clinical use. , 2002, Journal of child psychology and psychiatry, and allied disciplines.

[23]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[24]  Matthijs J. Warrens,et al.  Corrected Zegers-ten Berge Coefficients Are Special Cases of Cohen’s Weighted Kappa , 2014, Journal of Classification.

[25]  V. Abraira,et al.  Generalization of the Kappa coeficient for ordinal categorical data, multiple observers and incomplete designs , 1999 .

[26]  Matthijs J. Warrens,et al.  Weighted kappa is higher than Cohen's kappa for tridiagonal agreement tables , 2011 .

[27]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[28]  Kimberly J. Vannest,et al.  Reliability of multi-category rating scales. , 2013, Journal of school psychology.

[29]  P. E. Crewson,et al.  Reader agreement studies. , 2005, AJR. American journal of roentgenology.

[30]  Yeung Sam Hung,et al.  A comparative analysis of Spearman's rho and Kendall's tau in normal and contaminated normal models , 2013, Signal Process..

[31]  M. Warrens Conditional inequalities between Cohen's kappa and weighted kappas , 2013 .

[32]  W. Grift Measuring Teaching Quality in Several European Countries , 2014 .

[33]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[34]  Janis E. Johnston,et al.  Resampling Probability Values for Weighted Kappa with Multiple Raters , 2008, Psychological reports.

[35]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[36]  Matthijs J. Warrens A family of multi-rater kappas that can always be increased and decreased by combining categories , 2012 .

[37]  W. van De Grift,et al.  Measuring teaching quality in several European countries , 2014 .

[38]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[39]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[40]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[41]  R. F. Fagot,et al.  A generalized family of coefficients of relational agreement for numerical scales , 1993 .

[42]  Kenneth J. Berry,et al.  The Exact Variance of Weighted Kappa with Multiple Raters , 2007, Psychological reports.

[43]  P. Prescott,et al.  Issues in the Use of Kappa to Estimate Reliability , 1986, Medical care.

[44]  H. Kundel,et al.  Measurement of observer agreement. , 2003, Radiology.

[45]  D. Weiss,et al.  Interrater reliability and agreement. , 2000 .

[46]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[47]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[48]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[49]  Hubert J. A. Schouten,et al.  Nominal scale agreement among observers , 1986 .

[50]  Sophie Vanbelle,et al.  A New Interpretation of the Weighted Kappa Coefficients , 2016, Psychometrika.

[51]  G. Stahl,et al.  Methodological issues in developing a multi-dimensional coding procedure for small-group chat communication , 2007 .

[52]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[53]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[54]  Taban Baghfalaki,et al.  Weighted kappa as a function of unweighted kappas , 2015, Commun. Stat. Simul. Comput..

[55]  Christof Schuster,et al.  A Note on the Interpretation of Weighted Kappa and its Relations to Other Rater Agreement Statistics for Metric Scales , 2004 .

[56]  Matthijs J. Warrens,et al.  Some Paradoxical Results for the Quadratically Weighted Kappa , 2012 .

[57]  Matthijs J. Warrens,et al.  Equivalences of weighted kappas for multiple raters , 2012 .

[58]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[59]  P. Graham,et al.  The analysis of ordinal agreement data: beyond weighted kappa. , 1993, Journal of clinical epidemiology.

[60]  Matthijs J. Warrens,et al.  Inequalities between multi-rater kappas , 2010, Adv. Data Anal. Classif..

[61]  M. Kendall Rank Correlation Methods , 1949 .

[62]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[63]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[64]  M. Warrens Five ways to look at Cohen's kappa , 2015 .

[65]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[66]  Joost C F de Winter,et al.  Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. , 2016, Psychological methods.