The kappa statistic in reliability studies: use, interpretation, and sample size requirements.

PURPOSE This article examines and illustrates the use and interpretation of the kappa statistic in musculoskeletal research. SUMMARY OF KEY POINTS The reliability of clinicians' ratings is an important consideration in areas such as diagnosis and the interpretation of examination findings. Often, these ratings lie on a nominal or an ordinal scale. For such data, the kappa coefficient is an appropriate measure of reliability. Kappa is defined, in both weighted and unweighted forms, and its use is illustrated with examples from musculoskeletal research. Factors that can influence the magnitude of kappa (prevalence, bias, and non-independent ratings) are discussed, and ways of evaluating the magnitude of an obtained kappa are considered. The issue of statistical testing of kappa is considered, including the use of confidence intervals, and appropriate sample sizes for reliability studies using kappa are tabulated. CONCLUSIONS The article concludes with recommendations for the use and interpretation of kappa.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  A. Beckett,et al.  AKUFO AND IBARAPA. , 1965, Lancet.

[3]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[4]  P. Lee,et al.  Interpretation and Uses of Medical Statistics. , 1969 .

[5]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[6]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[7]  J J Bartko,et al.  ON THE METHODS AND THEORY OF RELIABILITY , 1976, The Journal of nervous and mental disease.

[8]  D P Hartmann,et al.  Considerations in the choice of interobserver reliability estimates. , 1977, Journal of applied behavior analysis.

[9]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[10]  J. Richard Landis,et al.  Large sample variance of kappa in the case of different sets of raters. , 1979 .

[11]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[12]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[13]  Dale J. Prediger,et al.  Coefficient Kappa: Some Uses, Misuses, and Alternatives , 1981 .

[14]  P. Prescott,et al.  Issues in the Use of Kappa to Estimate Reliability , 1986, Medical care.

[15]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[16]  W. Willett,et al.  Misinterpretation and misuse of the kappa statistic. , 1987, American journal of epidemiology.

[17]  S D Walter,et al.  A reappraisal of the kappa coefficient. , 1988, Journal of clinical epidemiology.

[18]  T. Gjørup,et al.  The Kappa Coefficient and the Prevalence of a Diagnosis , 1988, Methods of Information in Medicine.

[19]  A. A. Afifi,et al.  Sample size determinations for the two rater kappa statistic , 1988 .

[20]  J S Osberg,et al.  Kappa coefficient calculation using multiple ratings per subject: a special communication. , 1989, Physical therapy.

[21]  D. Streiner,et al.  Health Measurement Scales: A practical guide to thier development and use , 1989 .

[22]  Graham Dunn,et al.  Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors , 1989 .

[23]  P D Sampson,et al.  Measuring interrater reliability among multiple raters: an example of methods for nominal data. , 1990, Statistics in medicine.

[24]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[25]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[26]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[27]  M. McGregor Statistical methodology for reliability studies. , 1991, Journal of manipulative and physiological therapeutics.

[28]  S. Hollis,et al.  The Distress and Risk Assessment Method: A Simple Patient Classification to Identify Distress and Evaluate the Risk of Poor Outcome , 1992, Spine.

[29]  A. Silman,et al.  Statistical methods for assessing observer variability in clinical measures. , 1992, BMJ.

[30]  A Donner,et al.  A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. , 1992, Statistics in medicine.

[31]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[32]  A Donner,et al.  Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. , 1994, Biometrics.

[33]  H. Brenner,et al.  Dependence of Weighted Kappa Coefficients on the Number of Categories , 1996, Epidemiology.

[34]  C. Lantz,et al.  Behavior and interpretation of the κ statistic: Resolution of the two paradoxes , 1996 .

[35]  C. Goldsmith,et al.  Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. , 1997, Physical therapy.

[36]  Interpreting kappa values for two-observer nursing diagnosis data. , 1997, Research in nursing & health.

[37]  Lantz Ca Application and evaluation of the kappa statistic in the design and interpretation of chiropractic clinical research. , 1997 .

[38]  C. Lantz Application and evaluation of the kappa statistic in the design and interpretation of chiropractic clinical research. , 1997, Journal of manipulative and physiological therapeutics.

[39]  P. Shrout Measurement reliability and agreement in psychiatry , 1998, Statistical methods in medical research.

[40]  M. Hiller,et al.  The validity of self-reported cocaine use in a criminal justice treatment sample. , 1998, The American journal of drug and alcohol abuse.

[41]  S. Walter,et al.  Sample size and optimal designs for reliability studies. , 1998, Statistics in medicine.

[42]  H. Wachmann,et al.  Using the kappa coefficient as a measure of reliability or reproducibility. , 1998, Chest.

[43]  N. Reid,et al.  Statistical inference by confidence intervals: issues of interpretation and utilization. , 1999, Physical therapy.

[44]  C. Hawk,et al.  Preliminary study of the reliability of assessment procedures for indications for chiropractic adjustments of the lumbar spine. , 1999, Journal of manipulative and physiological therapeutics.

[45]  L. Strender,et al.  Interexaminer reliability in physical examination of the cervical spine. , 1999, Journal of manipulative and physiological therapeutics.

[46]  W. Rüther,et al.  Sacroiliac joint diagnostics in the Hamburg Construction Workers Study. , 1999, Journal of manipulative and physiological therapeutics.

[47]  A. Donner,et al.  The effect of collapsing multinomial data when assessing agreement. , 2000, International journal of epidemiology.

[48]  A. Rigby,et al.  Statistical methods in epidemiology. v. Towards an understanding of the kappa coefficient , 2000, Disability and rehabilitation.

[49]  F. Hoehler Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. , 2000, Journal of clinical epidemiology.

[50]  Julius Sim,et al.  Research in health care : concepts, designs and methods , 2000 .

[51]  I. Arvidsson,et al.  Inter-examiner reliability in assessing passive intervertebral motion of the cervical spine. , 2000, Manual therapy.

[52]  J. Fritz,et al.  The Use of a Classification Approach to Identify Subgroups of Patients With Acute Low Back Pain: Interrater Reliability and Short-Term Treatment Outcomes , 2000, Spine.

[53]  K. Hayes,et al.  Reliability of assessing end-feel and pain and resistance sequence in subjects with painful shoulders and knees. , 2001, The Journal of orthopaedic and sports physical therapy.

[54]  J. Fritz,et al.  Examining diagnostic tests: an evidence-based perspective. , 2001, Physical therapy.

[55]  D. Riddle,et al.  Evaluation of the presence of sacroiliac joint region dysfunction using a combination of tests: a multicenter intertester reliability study. , 2002, Physical therapy.

[56]  R. Pietrobon,et al.  Observer Variability in Assessing Lumbar Spinal Stenosis Severity on Magnetic Resonance Imaging and Its Relation to Cross-Sectional Spinal Canal Area , 2002, Spine.

[57]  B. Hannes,et al.  Multisurgeon Assessment of Coronal Pattern Classification Systems for Adolescent Idiopathic Scoliosis: Reliability and Error Analysis , 2002, Spine.

[58]  T. Videman,et al.  Interexaminer Reliability of Low Back Pain Assessment Using the McKenzie Method , 2002, Spine.

[59]  Art Noda,et al.  Kappa coefficients in medical research , 2002, Statistics in medicine.

[60]  M. Shoukri,et al.  Measures of Interobserver Agreement , 2003 .

[61]  J. Ouellet,et al.  Comparison of Reliability Between the Lenke and King Classification Systems for Adolescent Idiopathic Scoliosis Using Radiographs That Were Not Premeasured , 2003, Spine.

[62]  C. Ekdahl,et al.  Inter-tester reliability of a new diagnostic classification system for patients with non-specific low back pain. , 2004, The Australian journal of physiotherapy.