Interrater reliability and the kappa statistic: a comment on Morris et al. (2008).

Establishing interrater reliability of instruments is an important issue in nursing research and practice. Morris et al.’s (2008) paper highlights the problem of choosing the appropriate statistical approach for interrater reliability data analysis and the authors raise the important and relevant question how to interpret kappa-like statistics like Cohen’s (k) or weighted kappa (kw). It is true that the often called ‘chance corrected’ k was frequently criticised because its value is dependent on the prevalence of the rated trait in the sample (‘base rate problem’). Consequently, even if two raters nearly or exactly agree, k-coefficients are nearly or equal to 0 if the prevalence of the rated characteristic is very high or very low. This objects the natural expectation that interrater reliability must be high as well. However, this is neither a limitation nor a ‘‘main drawback’’ (p. 646). In fact it is a desired property, because k-coefficients are classical interrater reliability coefficients (Dunn, 2004; Kraemer et al., 2002; Landis and Koch, 1975). In the classical test theory, reliability is defined as the ratio of variability between subjects (or targets) to the total variability. The total variability is the sum of subject (target) variability and the measurement error (Dunn, 2004; Streiner and Norman, 2003). Consequently, if the variance between the subjects is very small or even zero the reliability coefficient would be near zero as well. Therefore, reliability coefficients do not only reflect the degree of agreement between raters, but also the degree to which a