论文信息 - Computing inter-rater reliability and its variance in the presence of high agreement.

Computing inter-rater reliability and its variance in the presence of high agreement.

Pi (pi) and kappa (kappa) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte-Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter-rater reliability statistics.

K. Gwet

[1] M. H. Quenouille. Approximate Tests of Correlation in Time‐Series , 1949 .

[2] W. A. Scott,et al. Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[3] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[4] J. Guilford,et al. A Note on the G Index of Agreement , 1964 .

[5] Jacob Cohen,et al. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[6] B. Everitt,et al. Large sample standard errors of kappa and weighted kappa. , 1969 .

[7] R. Light. Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[8] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .

[9] P. Holland,et al. Discrete Multivariate Analysis. , 1976 .

[10] J. R. Landis,et al. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[11] H. Kraemer. Ramifications of a population model forκ as a coefficient of reliability , 1979 .

[12] A. J. Conger. Integration and generalization of kappas for multiple raters. , 1980 .

[13] A. Feinstein,et al. High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[14] M. Banerjee,et al. Beyond kappa: A review of interrater agreement measures , 1999 .

[15] A. Winsor. Sampling techniques. , 2000, Nursing times.