High Agreement and High Prevalence: The Paradox of Cohen’s Kappa

Background: Cohen's Kappa is the most used agreement statistic in literature. However, under certain conditions, it is affected by a paradox which returns biased estimates of the statistic itself. Objective: The aim of the study is to provide sufficient information which allows the reader to make an informed choice of the correct agreement measure, by underlining some optimal properties of Gwet’s AC1 in comparison to Cohen’s Kappa, using a real data example. Method: During the process of literature review, we have asked a panel of three evaluators to come up with a judgment on the quality of 57 randomized controlled trials assigning a score to each trial using the Jadad scale. The quality was evaluated according to the following dimensions: adopted design, randomization unit, type of primary endpoint. With respect to each of the above described features, the agreement between the three evaluators has been calculated using Cohen’s Kappa statistic and Gwet’s AC1 statistic and, finally, the values have been compared with the observed agreement. Results: The values of the Cohen’s Kappa statistic would lead to believe that the agreement levels for the variables Unit, Design and Primary Endpoints are totally unsatisfactory. The AC1 statistic, on the contrary, shows plausible values which are in line with the respective values of the observed concordance. Conclusion: We conclude that it would always be appropriate to adopt the AC1 statistic, thus bypassing any risk of incurring the paradox and drawing wrong conclusions about the results of agreement analysis.

[1]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[2]  Jeffrey J. Fletcher,et al.  Inter-Observer Agreement on the Diagnosis of Neurocardiogenic Injury Following Aneurysmal Subarachnoid Hemorrhage , 2014, Neurocritical Care.

[3]  T. Marwick,et al.  Development of a consensus algorithm to improve interobserver agreement and accuracy in the determination of tricuspid regurgitation severity. , 2014, Journal of the American Society of Echocardiography : official publication of the American Society of Echocardiography.

[4]  B. Seifert,et al.  Imaging Non-Specific Wrist Pain: Interobserver Agreement and Diagnostic Accuracy of SPECT/CT, MRI, CT, Bone Scan and Plain Radiographs , 2013, PloS one.

[5]  N. Egund,et al.  Spondyloarthritis-related and degenerative MRI changes in the axial skeleton - an inter- and intra-observer agreement study , 2013, BMC Musculoskeletal Disorders.

[6]  K. Gwet Computing inter-rater reliability and its variance in the presence of high agreement. , 2008, The British journal of mathematical and statistical psychology.

[7]  Tasha R. Stanton,et al.  Scales to Assess the Quality of Randomized Controlled Trials: A Systematic Review , 2008, Physical Therapy.

[8]  D. Altman,et al.  Systematic reviews in health care: Assessing the quality of controlled clinical trials. , 2001, BMJ.

[9]  D. Moher,et al.  The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials , 2001, Annals of Internal Medicine.

[10]  I. Olkin,et al.  Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement , 1999, The Lancet.

[11]  H. Vet,et al.  The Delphi list: a criteria list for quality assessment of randomized clinical trials for conducting systematic reviews developed by Delphi consensus. , 1998, Journal of clinical epidemiology.

[12]  A R Jadad,et al.  Assessing the Quality of Randomized Controlled Trials: Current Issues and Future Directions , 1996, International Journal of Technology Assessment in Health Care.

[13]  A R Jadad,et al.  Assessing the quality of reports of randomized clinical trials: is blinding necessary? , 1996, Controlled clinical trials.

[14]  A R Jadad,et al.  Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. , 1995, Controlled clinical trials.

[15]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[16]  M. Aickin Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa. , 1990, Biometrics.

[17]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[18]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[19]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[20]  D. Weiss,et al.  Interrater reliability and agreement of subjective judgments , 1975 .

[21]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[22]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[23]  I. Baldi,et al.  Research in Nursing and Nutrition: Is Randomized Clinical Trial the Actual Gold Standard? , 2017, Gastroenterology nursing : the official journal of the Society of Gastroenterology Nurses and Associates.

[24]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[25]  K. Gwet Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity , 2002 .

[26]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[27]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[28]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[29]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .