TUTORIAL IN BIOSTATISTICS Kappa coecients in medical research

SUMMARY Kappa coecients are measures of correlation between categorical variables often used as reliability or validity coecients. We recapitulate development and denitions of the K (categories) by M (ratings) kappas (K × M ), discuss what they are well- or ill-designed to do, and summarize where kappas now stand with regard to their application in medical research. The 2 × M (M ?2) intraclass kappa seems the ideal measure of binary reliability; a 2 × 2 weighted kappa is an excellent choice, though not a unique one, as a validity measure. For both the intraclass and weighted kappas, we address continuing problems with kappas. There are serious problems with using the K × M intraclass (K?2) or the various K × M weighted kappas for K? 2o r M? 2 in any context, either because they convey incomplete and possibly misleading information, or because other approaches are preferable to their use. We illustrate the use of the recommended kappas with applications in medical research. Copyright ? 2002 John Wiley & Sons, Ltd.

[1]  H. Kraemer,et al.  How many raters? Toward the most reliable diagnostic consensus. , 1992, Statistics in medicine.

[2]  A Donner,et al.  Sample size requirements for the comparison of two or more coefficients of inter-observer agreement. , 1998, Statistics in medicine.

[3]  H. Kraemer,et al.  Statistical issues in assessing comorbidity. , 1995, Statistics in medicine.

[4]  S. M. May,et al.  Modelling observer agreement--an alternative to kappa. , 1994, Journal of clinical epidemiology.

[5]  K. B. Hafner,et al.  On assessing interrater agreement for multiple attribute responses. , 1989, Biometrics.

[6]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[7]  A hierarchical approach to inferences concerning interobserver agreement for multinomial data. , 1997, Statistics in medicine.

[8]  How many raters are needed for a reliable diagnosis? , 2001 .

[9]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[10]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[11]  Jacob Cohen The Cost of Dichotomization , 1983 .

[12]  A Comparison of Three Indexes of Agreement Between Observers: Proportion of Agreement, G-Index, and Kappa , 1981 .

[13]  Helena Chmura Kraemer,et al.  Evaluating Medical Tests: Objective and Quantitative Guidelines , 1992 .

[14]  A E Maxwell,et al.  Coefficients of Agreement Between Observers and Their Interpretation , 1977, British Journal of Psychiatry.

[15]  C. Lantz,et al.  Behavior and interpretation of the κ statistic: Resolution of the two paradoxes , 1996 .

[16]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[17]  J. Fleiss Measuring agreement between two judges on the presence or absence of a trait. , 1975, Biometrics.

[18]  M. Eliasziw,et al.  Testing the homogeneity of kappa statistics. , 1996, Biometrics.

[19]  J. Koval,et al.  Interval estimation for Cohen's kappa as a measure of agreement. , 2000, Statistics in medicine.

[20]  Helena C. Kraemer,et al.  Estimating false alarms and missed events from interobserver agreement: Comment on Kaye. , 1982 .

[21]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[22]  H. Kraemer Ramifications of a population model forκ as a coefficient of reliability , 1979 .

[23]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[24]  H. Huynh Reliability of multiple classifications , 1978 .

[25]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[26]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[27]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[28]  James A. Hanley Standard error of the kappa statistic. , 1987 .

[29]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[30]  V. Flack Confidence intervals for the interrater agreement measure kappa , 1987 .

[31]  C. Janes Agreement Measurement and the Judgment Process , 1979, The Journal of nervous and mental disease.

[32]  J. R. Landis,et al.  A one-way components of variance model for categorical data , 1977 .

[33]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[34]  A R Feinstein,et al.  A bibliography of publications on observer variability. , 1985, Journal of chronic diseases.

[35]  H. Kraemer What is the 'right' statistical measure of twin concordance (or diagnostic reliability and validity)? , 1997, Archives of general psychiatry.

[36]  J. Fleiss,et al.  Inference About Weighted Kappa in the Non-Null Case , 1978 .

[37]  J J Bartko,et al.  ON THE METHODS AND THEORY OF RELIABILITY , 1976, The Journal of nervous and mental disease.

[38]  H. Kraemer,et al.  Measurement of reliability for categorical data in medical research , 1992, Statistical methods in medical research.

[39]  Kappa, Measures of Marginal Symmetry and Intraclass Correlations , 1985 .

[40]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[41]  J. Fleiss,et al.  Jackknifing functions of multinomial frequencies, with an application to a measure of concordance. , 1982, American journal of epidemiology.

[42]  M. Aickin Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa. , 1990, Biometrics.

[43]  P. Shrout Measurement reliability and agreement in psychiatry , 1998, Statistical methods in medical research.

[44]  I. Gottesman,et al.  Reliability and validity in binary ratings: areas of common misunderstanding in diagnosis and symptom ratings. , 1978, Archives of general psychiatry.

[45]  E. Spitznagel,et al.  A proposed solution to the base rate problem in the kappa statistic. , 1985, Archives of general psychiatry.

[46]  M. Eliasziw,et al.  Sample size requirements for reliability studies. , 1987, Statistics in medicine.

[47]  J. Darroch,et al.  Category Distinguishability and Observer Agreement , 1986 .

[48]  R. Fisher Statistical Methods for Research Workers , 1971 .

[49]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[50]  Dale J. Prediger,et al.  Coefficient Kappa: Some Uses, Misuses, and Alternatives , 1981 .

[51]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[52]  D. C. Ross Testing Patterned Hypotheses in Multi-Way Contingency Tables Using Weighted Kappa and Weighted Chi Square , 1977 .