Measuring interrater reliability among multiple raters: an example of methods for nominal data.

This paper reviews and critiques various approaches to the measurement of reliability among multiple raters in the case of nominal data. We consider measurement of the overall reliability of a group of raters (using kappa-like statistics) as well as the reliability of individual raters with respect to a group. We introduce modifications of previously published estimators appropriate for measurement of reliability in the case of stratified sampling frames and we interpret these measures in view of standard errors computed using the jackknife. Analyses of a set of 48 anaesthesia case histories in which 42 anaesthesiologists independently rated the appropriateness of care on a nominal scale serve as an example.

[1]  H. Kraemer Ramifications of a population model forκ as a coefficient of reliability , 1979 .

[2]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[3]  Annette J. Dobson,et al.  General observer-agreement measures on individual subjects and groups of subjects , 1984 .

[4]  Standard of care and anesthesia liability. , 1989 .

[5]  H. Schouten,et al.  Measuring pairwise interobserver agreement when all subjects are judged by the same observers , 1982 .

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  S Shapiro,et al.  The Abbreviated Injury Scale and Injury Severity Score: Levels of Inter- and Intrarater Reliability , 1985, Medical care.

[8]  R. Caplan,et al.  Peer reviewer agreement for major anesthetic mishaps. , 1988, QRB. Quality review bulletin.

[9]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[10]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[11]  G. W. Williams,et al.  Comparing the joint agreement of several raters with another rater. , 1976, Biometrics.

[12]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[13]  J. Fleiss The design and analysis of clinical experiments , 1987 .

[14]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[15]  Allan R. Wilks,et al.  The new S language: a programming environment for data analysis and graphics , 1988 .

[16]  S. Gross,et al.  The kappa coefficient of agreement for multiple observers when the number of subjects is small. , 1986, Biometrics.

[17]  P. Prescott,et al.  Issues in the Use of Kappa to Estimate Reliability , 1986, Medical care.

[18]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[19]  J. Richard Landis,et al.  Large sample variance of kappa in the case of different sets of raters. , 1979 .

[20]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[21]  S. Horn,et al.  Measuring Severity of Illness: A Reliability Study , 1983, Medical care.

[22]  H. Kraemer,et al.  Extension of the kappa coefficient. , 1980, Biometrics.

[23]  J. R. Landis,et al.  A one-way components of variance model for categorical data , 1977 .

[24]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .