Some common errors of experimental design, interpretation and inference in agreement studies

We signal and discuss common methodological errors in agreement studies and the use of kappa indices, as found in publications in the medical and behavioural sciences. Our analysis is based on a proposed statistical model that is in line with the typical models employed in metrology and measurement theory. A first cluster of errors is related to nonrandom sampling, which results in a potentially substantial bias in the estimated agreement. Second, when class prevalences are strongly nonuniform, the use of the kappa index becomes precarious, as its large partial derivatives result in typically large standard errors of the estimates. In addition, the index reflects rather one-sidedly in such cases the consistency of the most prevalent class, or the class prevalences themselves. A final cluster of errors concerns interpretation pitfalls, which may lead to incorrect conclusions based on agreement studies. These interpretation issues are clarified on the basis of the proposed statistical modelling. The signalled errors are illustrated from actual studies published in prestigious journals. The analysis results in a number of guidelines and recommendations for agreement studies, including the recommendation to use alternatives to the kappa index in certain situations.

[1]  Jeroen De Mast,et al.  Measurement System Analysis for Binary Inspection: Continuous Versus Dichotomous Measurands , 2011 .

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  George Hripcsak,et al.  Measuring agreement in medical informatics reliability studies , 2002, J. Biomed. Informatics.

[4]  J Mastde,et al.  Measurement system analysis for categorical data: Agreement and kappa type indices , 2007 .

[5]  Shelby J. Haberman,et al.  Dispersion of Categorical Variables and Penalty Functions: Derivation, Estimation, and Comparability , 1995 .

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[8]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[9]  Dale J. Prediger,et al.  Coefficient Kappa: Some Uses, Misuses, and Alternatives , 1981 .

[10]  H. Kraemer,et al.  2 x 2 kappa coefficients: measures of agreement or association. , 1989, Biometrics.

[11]  D. Greenblatt,et al.  A method for estimating the probability of adverse drug reactions , 1981, Clinical pharmacology and therapeutics.

[12]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[13]  S. Standard GUIDE TO THE EXPRESSION OF UNCERTAINTY IN MEASUREMENT , 2006 .

[14]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[15]  Matthijs J. Warrens,et al.  A Formal Proof of a Paradox Associated with Cohen’s Kappa , 2010, J. Classif..

[16]  Jeroen de Mast,et al.  Measurement System Analysis for Binary Data , 2008, Technometrics.

[17]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[18]  Toby Berger Information Theory and Coding Theory , 2006 .

[19]  H. Kraemer Ramifications of a population model forκ as a coefficient of reliability , 1979 .

[20]  M. J. Allen Introduction to Measurement Theory , 1979 .

[21]  S. Hershberger,et al.  Measures of Association , 2005 .

[22]  S. O’Keeffe,et al.  A comparison of two techniques for ankle Jerk assessment in elderly subjects , 1994, The Lancet.

[23]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[24]  Art Noda,et al.  Kappa coefficients in medical research , 2002, Statistics in medicine.

[25]  D. Cicchetti,et al.  Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. , 1981, American journal of mental deficiency.

[26]  J. Weisz,et al.  Youth psychotherapy outcome research: a review and critique of the evidence base. , 2005, Annual review of psychology.

[27]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[28]  John S. Uebersax,et al.  Diversity of decision-making models and the measurement of interrater agreement. , 1987 .

[29]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[30]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[31]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[32]  Martin A. Tanner,et al.  Modeling Agreement among Raters , 1985 .

[33]  S D Walter,et al.  A reappraisal of the kappa coefficient. , 1988, Journal of clinical epidemiology.

[34]  N. Andreasen,et al.  Reliability studies of psychiatric diagnosis. Theory and practice. , 1981, Archives of general psychiatry.

[35]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[36]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[37]  R. Kiesslich,et al.  Inter- and Intra-Observer Variability of Magnification Chromoendoscopy for Detecting Specialized Intestinal Metaplasia at the Gastroesophageal Junction , 2004, Endoscopy.

[38]  J. D. Mast Agreement and Kappa-Type Indices , 2007 .

[39]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[40]  Jeroen de Mast,et al.  Measurement system analysis for categorical measurements: Agreement and kappa-type indices , 2007 .

[41]  M. A. Spiteri,et al.  RELIABILITY OF ELICITING PHYSICAL SIGNS IN EXAMINATION OF THE CHEST , 1988, The Lancet.