Coefficient Kappa: Some Uses, Misuses, and Alternatives

This paper considers some appropriate and inappropriate uses of coefficient kappa and alternative kappa-like statistics. Discussion is restricted to the descriptive characteristics of these statistics for measuring agreement with categorical data in studies of reliability and validity. Special consideration is given to assumptions about whether marginals are fixed a priori, or free to vary. In reliability studies, when marginals are fixed, coefficient kappa is found to be appropriate. When either or both of the marginals are free to vary, however, it is suggested that the "chance" term in kappa be replaced by 1/n, where n is the number of categories. In validity studies, we suggest considering whether one wants an index of improvement beyond "chance" or beyond the best a priori strategy employing base rates. In the former case, considerations are similar to those in reliability studies with the marginals for the criterion measure considered as fixed. In the latter case, it is suggested that the largest marginal proportion for the criterion measure be used in place of the "chance" term in kappa. Similarities and differences among these statistics are discussed and illustrated with synthetic data.

[1]  R. Fisher The Advanced Theory of Statistics , 1943, Nature.

[2]  Louis Guttman,et al.  The test-retest reliability of qualitative data , 1946, Psychometrika.

[3]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[4]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[5]  L. Cronbach,et al.  Psychological tests and personnel decisions , 1958 .

[6]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[7]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[8]  Lee J. Cronbach,et al.  Psychological tests and personnel decisions , 1958 .

[9]  K. Krippendorff Bivariate Agreement Coefficients for Reliability of Data , 1970 .

[10]  R. J. Hill,et al.  Coding Reliability and Validity of Interview Data , 1971 .

[11]  G F Lawlis,et al.  Judgment of counseling process: reliability, agreement, and error. , 1972, Psychological bulletin.

[12]  J. Fleiss Measuring agreement between two judges on the presence or absence of a trait. , 1975, Biometrics.

[13]  D. Weiss,et al.  Interrater reliability and agreement of subjective judgments , 1975 .

[14]  G. D. Gottfredson,et al.  Vocational Choices of Men and Women: A Comparison of Predictors From the Self-Directed Search. , 1975 .

[15]  J. Touchton,et al.  Occupational Daydreams as Predictors of Vocational Plans of College Women. , 1977 .

[16]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .