Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial.

Many research designs require the assessment of inter-rater reliability (IRR) to demonstrate consistency among observational ratings provided by multiple coders. However, many studies use incorrect statistical procedures, fail to fully report the information necessary to interpret their results, or do not address how IRR affects the power of their subsequent analyses for hypothesis testing. This paper provides an overview of methodological issues related to the assessment of IRR with a focus on study design, selection of appropriate statistics, and the computation, interpretation, and reporting of some commonly-used IRR statistics. Computational examples include SPSS and R syntax for computing Cohen's kappa and intra-class correlations to assess IRR.

[1]  Kevin A. Hallgren Erratum to Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. , 2013 .

[2]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[5]  Geoffrey R. Norman,et al.  Biostatistics: The Bare Essentials , 1993 .

[6]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[7]  Tirso E. Diaz,et al.  Ill-structured measurement designs in organizational research: implications for estimating interrater reliability. , 2008, The Journal of applied psychology.

[8]  Guy L. Lacroix,et al.  Formatting data files for repeated-measures analyses in SPSS: Using the Aggregate and Restructure procedures , 2006 .

[9]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[10]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[11]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[12]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[13]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[14]  Geoffrey R. Norman,et al.  Comprar Biostatistics: The Bare Essentials with CDROM | David L. Streiner | 9781550093476 | BC Decker , 2007 .

[15]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[16]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[17]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[18]  S. Gross,et al.  The kappa coefficient of agreement for multiple observers when the number of subjects is small. , 1986, Biometrics.

[19]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[20]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[21]  D. Cicchetti Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. , 1994 .

[22]  M. R. Novick The axioms and principal results of classical test theory , 1965 .

[23]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[24]  Frederic M. Lord,et al.  Statistical inferences about true scores , 1959 .