Measurement of observer agreement.

Statistical measures are described that are used in diagnostic imaging for expressing observer agreement in regard to categorical data. The measures are used to characterize the reliability of imaging methods and the reproducibility of disease classifications and, occasionally with great care, as the surrogate for accuracy. The review concentrates on the chance-corrected indices, kappa and weighted kappa. Examples from the imaging literature illustrate the method of calculation and the effects of both disease prevalence and the number of rating categories. Other measures of agreement that are used less frequently, including multiple-rater kappa, are referenced and described briefly.

[1]  W E CHAMBERLAIN,et al.  Tuberculosis case finding; a comparison of the effectiveness of various roentgenographic and photofluorographic methods. , 1947, Journal of the American Medical Association.

[2]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[3]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  R. Swensson,et al.  Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. , 1977, Investigative radiology.

[6]  H. Kundel,et al.  The Effect of Verification on the Assessment of Imaging Techniques , 1983, Investigative radiology.

[7]  J. Swets Indices of discrimination or diagnostic accuracy: their ROCs and implied models. , 1986, Psychological bulletin.

[8]  S. Somers,et al.  Interobserver variation in the interpretation of abdominal radiographs. , 1989, Radiology.

[9]  M. Bronskill,et al.  Receiver Operator characteristic (ROC) Analysis without Truth , 1990, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[11]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[12]  D. Carmody,et al.  Limited correlation of left ventricular end-diastolic pressure with radiographic assessment of pulmonary hemodynamics. , 1990, Radiology.

[13]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[14]  J S Uebersax,et al.  Modeling approaches for the analysis of observer agreement. , 1992, Investigative radiology.

[15]  D B Kopans,et al.  The accuracy of mammographic interpretation. , 1994, The New England journal of medicine.

[16]  J. Elmore,et al.  Variability in radiologists' interpretations of mammograms. , 1994, The New England journal of medicine.

[17]  C. Floyd,et al.  Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. , 1996, AJR. American journal of roentgenology.

[18]  H L Kundel,et al.  Accuracy of bedside chest hard-copy screen-film versus hard- and soft-copy computed radiographs in a medical intensive care unit: receiver operating characteristic analysis. , 1997, Radiology.

[19]  H L Kundel,et al.  Mixture distribution and receiver operating characteristic analysis of bedside chest imaging with screen-film and computed radiography. , 1997, Academic radiology.

[20]  Harold L. Kundel,et al.  Comparing observer performance with mixture distribution analysis when there is no external gold standard , 1998, Medical Imaging.

[21]  P. Robinson,et al.  Variation between experienced observers in the interpretation of accident and emergency radiographs. , 1999, The British journal of radiology.

[22]  Agreement and Accuracy Mixture Distribution Analysis , 2000 .

[23]  J. Elmore,et al.  Accuracy of screening mammography using single versus independent double interpretation. , 2000, AJR. American journal of roentgenology.

[24]  R. Kivisaari,et al.  ESTIMATION OR QUANTIFICATION OF TUMOUR VOLUME , 2001 .

[25]  T. Vehmas,et al.  Estimation or quantification of tumour volume? CT study on irregular phantoms , 2001, Acta radiologica.

[26]  Observer variation in the detection of osteopenia , 2004, Skeletal Radiology.