Assessing the inter-rater agreement for ordinal data through weighted indexes

Assessing the inter-rater agreement between observers, in the case of ordinal variables, is an important issue in both the statistical theory and biomedical applications. Typically, this problem has been dealt with the use of Cohen’s weighted kappa, which is a modification of the original kappa statistic, proposed for nominal variables in the case of two observers. Fleiss (1971) put forth a generalization of kappa in the case of multiple observers, but both Cohen’s and Fleiss’ kappa could have a paradoxical behavior, which may lead to a difficult interpretation of their magnitude. In this paper, a modification of Fleiss’ kappa, not affected by paradoxes, is proposed, and subsequently generalized to the case of ordinal variables. Monte Carlo simulations are used both to testing statistical hypotheses and to calculating percentile and bootstrap-t confidence intervals based on this statistic. The normal asymptotic distribution of the proposed statistic is demonstrated. Our results are applied to the classical Holmquist et al.’s (1967) dataset on the classification, by multiple observers, of carcinoma in situ of the uterine cervix. Finally, we generalize the use of s* to a bivariate case.

[1]  Klaus Krippendorff,et al.  Estimating the Reliability, Systematic Error and Random Error of Interval Data , 1970 .

[2]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[3]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[4]  Ying Guo,et al.  Measuring Agreement of Multivariate Discrete Survival Times Using a Modified Weighted Kappa Coefficient , 2009, Biometrics.

[5]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[6]  Annette J. Dobson,et al.  General observer-agreement measures on individual subjects and groups of subjects , 1984 .

[7]  J. Slattery,et al.  Interobserver agreement for the assessment of handicap in stroke patients. , 1989, Stroke.

[8]  Christof Schuster,et al.  A Note on the Interpretation of Weighted Kappa and its Relations to Other Rater Agreement Statistics for Metric Scales , 2004 .

[9]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[10]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[11]  T. Allison,et al.  A New Procedure for Assessing Reliability of Scoring EEG Sleep Recordings , 1971 .

[12]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[13]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[14]  Bivariate Coefficients of Agreement among any Number of Observers , 1993 .

[15]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[16]  P. Mielke,et al.  A Generalization of Cohen's Kappa Agreement Measure to Interval Measurement and Multiple Raters , 1988 .

[17]  N D Holmquist,et al.  Variability in classification of carcinoma in situ of the uterine cervix. , 1967, Archives of pathology.

[18]  E. Lehmann Elements of large-sample theory , 1998 .

[19]  M. Dow Explicit inverses of Toeplitz and associated matrices , 2008 .

[20]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[21]  K. Krippendorff Bivariate Agreement Coefficients for Reliability of Data , 1970 .

[22]  A. Agresti An agreement model with kappa as parameter , 1989 .

[23]  On avoiding paradoxes in assessing inter rater agreement , 2010 .

[24]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[25]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[26]  J C Nelson,et al.  Statistical description of interrater variability in ordinal ratings , 2000, Statistical methods in medical research.

[27]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[28]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[29]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[30]  S Kumanyika,et al.  A weighted concordance correlation coefficient for repeated measurement designs. , 1996, Biometrics.

[31]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[32]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[33]  Ulf Olsson,et al.  A Measure of Agreement for Interval or Nominal Multivariate Observations by Different Sets of Judges , 2004 .

[34]  H. Brenner,et al.  Dependence of Weighted Kappa Coefficients on the Number of Categories , 1996, Epidemiology.

[35]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[36]  Adelin Albert,et al.  A note on the linearly weighted kappa coefficient for ordinal scales , 2009 .

[37]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[38]  A. S. Hedayat,et al.  Statistical Tools for Measuring Agreement , 2011 .

[39]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[40]  M. Lawton,et al.  Assessment of Older People: Self-Maintaining and Instrumental Activities of Daily Living , 1969 .

[41]  Brian Everitt,et al.  MOMENTS OF THE STATISTICS KAPPA AND WEIGHTED KAPPA , 1968 .

[42]  Albert Westergren,et al.  Statistical methods for assessing agreement for ordinal data. , 2005, Scandinavian journal of caring sciences.

[43]  H. Kraemer,et al.  Measurement of reliability for categorical data in medical research , 1992, Statistical methods in medical research.

[44]  P. Shrout Measurement reliability and agreement in psychiatry , 1998, Statistical methods in medical research.