In their article, Costa-Santos et al. [1] provide a valuable example about the difficulties in comparing and interpreting reliability and agreement coefficients arising from the same measurement situation. Debates and proposals about what the correct coefficients to measure agreement and reliability are can be traced back to the early 1980s [2,3]. Various approaches were discussed to overcome the ‘‘limitations’’ and ‘‘drawbacks’’ of reliability measures (e.g., Refs. [4,5]), and even today, new alternatives are proposed (e.g., Refs. [6,7]). However, it seems that much of the confusion around reliability and agreement estimation was and is caused by conceptual ambiguities. There are important differences between the concepts of agreement and reliability (e.g., Refs. [8,9]). Agreement points to the question, whether diagnoses, scores, or judgments are identical or similar or the degree to which they differ. In this situation, the absolute degree of measurement error is of interest. Consequently, any variability between subjects or the distribution of the rated trait in the population does not matter. For instance, percent agreement for nominal data or limits of agreement for interval and ratio data are excellent measures because they provide this very kind of information in a simple manner. On the other hand, there are the reliability coefficients. Reliability is typically defined as the ratio of variability between scores of the same subjects (e.g., by different raters or at different times) to the total variability of all scores in the sample. Therefore, reliability coefficients (e.g., kappa, intraclass correlation coefficient) provide information about the ability of the scores to distinguish between subjects. From this, it also follows that reliability coefficients must be low when there is little variability among the scores or diagnoses obtained from the instrument under investigation. This occurs when the range of obtained scores is restricted or prevalence is very high or very low. For example, if all raters rate medical students as ‘‘excellent,’’ the agreement is perfect, but the reliability of the scale is zero because there is no between-subject variance. It should also be noted that exact agreement among raters or over time does not enter into (most of) the formulas for reliability because all that matters is that the subjects are rank ordered similarly across time or by different raters. Interestingly in their introduction, Costa-Santos et al. [1] refer to Burdock et al. [10] stating that these authors proposed a ‘‘cutoff value of 0.75 . to signify good agree-
[1]
João Bernardes,et al.
Agreement studies in obstetrics and gynaecology: inappropriateness, controversies and consequences
,
2005,
BJOG : an international journal of obstetrics and gynaecology.
[2]
D. Streiner,et al.
Health measurement scales
,
2008
.
[3]
M. Potter,et al.
Resolving the paradoxes
,
2008
.
[4]
J. Bernardes,et al.
The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement.
,
2011,
Journal of clinical epidemiology.
[5]
André Souto,et al.
Assessment of disagreement: a new information-based approach.
,
2010,
Annals of epidemiology.
[6]
P. Prescott,et al.
Issues and approaches to estimating interrater reliability in nursing research.
,
1981,
Research in nursing & health.
[7]
K. Gwet.
Computing inter-rater reliability and its variance in the presence of high agreement.
,
2008,
The British journal of mathematical and statistical psychology.
[8]
Rebecca Zwick,et al.
Another look at interrater agreement.
,
1988,
Psychological bulletin.
[9]
A. House,et al.
Measures of interobserver agreement: Calculation formulas and distribution effects
,
1981
.
[10]
C. Terwee,et al.
When to use agreement versus reliability measures.
,
2006,
Journal of clinical epidemiology.
[11]
Werner Vach,et al.
The dependence of Cohen's kappa on the prevalence does not matter.
,
2005,
Journal of clinical epidemiology.
[12]
A. Feinstein,et al.
High agreement but low kappa: II. Resolving the paradoxes.
,
1990,
Journal of clinical epidemiology.
[13]
Joseph L. Fleiss,et al.
A NEW VIEW OF INTER‐OBSERVER AGREEMENT
,
1963
.