Interrater agreement and interrater reliability: key concepts, approaches, and applications.

Evaluations of interrater agreement and interrater reliability can be applied to a number of different contexts and are frequently encountered in social and administrative pharmacy research. The objectives of this study were to highlight key differences between interrater agreement and interrater reliability; describe the key concepts and approaches to evaluating interrater agreement and interrater reliability; and provide examples of their applications to research in the field of social and administrative pharmacy. This is a descriptive review of interrater agreement and interrater reliability indices. It outlines the practical applications and interpretation of these indices in social and administrative pharmacy research. Interrater agreement indices assess the extent to which the responses of 2 or more independent raters are concordant. Interrater reliability indices assess the extent to which raters consistently distinguish between different responses. A number of indices exist, and some common examples include Kappa, the Kendall coefficient of concordance, Bland-Altman plots, and the intraclass correlation coefficient. Guidance on the selection of an appropriate index is provided. In conclusion, selection of an appropriate index to evaluate interrater agreement or interrater reliability is dependent on a number of factors including the context in which the study is being undertaken, the type of variable under consideration, and the number of raters making assessments.

[1]  D. Streiner,et al.  Health Measurement Scales: A practical guide to thier development and use , 1989 .

[2]  A. Hrõbjartsson,et al.  Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. , 2011, Journal of clinical epidemiology.

[3]  David Holdford,et al.  Content analysis methods for conducting research in social and administrative pharmacy. , 2008, Research in social & administrative pharmacy : RSAP.

[4]  C. Terwee,et al.  When to use agreement versus reliability measures. , 2006, Journal of clinical epidemiology.

[5]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[6]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[7]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[8]  D L Streiner,et al.  Learning how to differ: agreement and reliability statistics in psychiatry. , 1995, Canadian journal of psychiatry. Revue canadienne de psychiatrie.

[9]  M. Potter,et al.  Resolving the paradoxes , 2008 .

[10]  L. Streiner David,et al.  Learning how to Differ: Agreement and Reliability Statistics in Psychiatry , 1995 .

[11]  J. Moisan,et al.  Measures of adherence based on self-report exhibited poor agreement with those based on pharmacy records. , 2005, Journal of clinical epidemiology.

[12]  J. Fleiss,et al.  The Reliability of Dichotomous Judgments: Unequal Numbers of Judges per Subject , 1979 .

[13]  Nicholas Moore,et al.  Inter-expert agreement of seven criteria in causality assessment of adverse drug reactions. , 2007, British journal of clinical pharmacology.

[14]  P. Prescott,et al.  Issues in the Use of Kappa to Estimate Reliability , 1986, Medical care.

[15]  C. Yee,et al.  A Model for Decision Support in Signal Triage , 2008, Drug safety.

[16]  D. Weiss,et al.  Interrater reliability and agreement of subjective judgments , 1975 .

[17]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[18]  J. George,et al.  Development and Validation of the Medication-Based Disease Burden Index , 2006, The Annals of pharmacotherapy.

[19]  Steven D. Brown,et al.  Handbook of applied multivariate statistics and mathematical modeling , 2000 .

[20]  G. Stoddard,et al.  Agreement Between Pharmacists for Problem Identification: An Initial Quality Measurement of Cognitive Services , 2009, The Annals of pharmacotherapy.

[21]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[22]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[23]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[24]  Å. Reikvam,et al.  Pharmacy sales data versus ward stock accounting for the surveillance of broad-spectrum antibiotic use in hospitals , 2011, BMC medical research methodology.

[25]  D. Altman,et al.  Measuring agreement in method comparison studies , 1999, Statistical methods in medical research.

[26]  Art Noda,et al.  Kappa coefficients in medical research , 2002, Statistics in medicine.

[27]  D. Weiss,et al.  Interrater reliability and agreement. , 2000 .

[28]  Johnson George,et al.  Development and Validation of the Medication Regimen Complexity Index , 2004, The Annals of pharmacotherapy.

[29]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[30]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[31]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[32]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[33]  J. Baeyens,et al.  Inter-rater reliability of STOPP (Screening Tool of Older Persons' Prescriptions) and START (Screening Tool to Alert doctors to Right Treatment) criteria amongst physicians in six European countries. , 2009, Age and ageing.

[34]  Domenic V. Cicchetti,et al.  Testing the Normal Approximation and Minimal Sample Size Requirements of Weighted Kappa When the Number of Categories is Large , 1981 .

[35]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[36]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[37]  A. Beckett,et al.  AKUFO AND IBARAPA. , 1965, Lancet.

[38]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[39]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[40]  James M. LeBreton,et al.  Answers to 20 Questions About Interrater Reliability and Interrater Agreement , 2008 .

[41]  Timothy F. Chen,et al.  An expert panel assessment of comprehensive medication reviews for clients of community mental health teams , 2010, Social Psychiatry and Psychiatric Epidemiology.

[42]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[43]  A. de la Sierra,et al.  Agreement Between Community Pharmacy and Ambulatory and Home Blood Pressure Measurement Methods to Assess the Effectiveness of Antihypertensive Treatment: The MEPAFAR Study , 2012, Journal of clinical hypertension.

[44]  Roy C. Schmidt,et al.  MANAGING DELPHI SURVEYS USING NONPARAMETRIC STATISTICAL TECHNIQUES , 1997 .

[45]  G. Koren,et al.  Reliability and validity of observer ratings of pain using the visual analog scale (VAS) in infants undergoing immunization injections , 2009, PAIN®.