The measurement of observer agreement for categorical data.

This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

[1]  A. Wald Tests of statistical hypotheses concerning several parameters when the number of observations is large , 1943 .

[2]  Frank E. Grubbs,et al.  On Estimating Precision of Measuring Instruments and Product Variability , 1948 .

[3]  T. A. Bancroft,et al.  Statistical Theory in Research , 1952 .

[4]  L. Kurland,et al.  Studies on multiple sclerosis in Winnipeg, Manitoba, and New Orleans, Louisiana. II. A controlled investigation of factors in the life history of the Winnipeg patients. , 1953, American journal of hygiene.

[5]  L. Kurland,et al.  Studies on multiple sclerosis in Winnepeg, Manitoba, and New Orleans, Louisiana. I. Prevalence; comparison between the patient groups in Winnipeg and New Orleans. , 1953, American journal of hygiene.

[6]  Leo A. Goodman,et al.  Corrigenda: Measures of Association for Cross Classifications , 1957 .

[7]  J. Mandel The Measuring Process , 1959 .

[8]  H. Scheffé,et al.  The Analysis of Variance , 1960 .

[9]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[10]  L. A. Goodman,et al.  Measures of Association for Cross Classifications III: Approximate Sampling Theory , 1963 .

[11]  G. Koch A general approach to estimation of variance components , 1967 .

[12]  V. P. Bhapkar A Note on the Equivalence of Two Test Criteria for Hypotheses in Categorical Data , 1966 .

[13]  J. Fleiss Assessing the Accuracy of Multivariate Observations , 1966 .

[14]  Gary G. Koch,et al.  Some Further Remarks Concerning "A General Approach to the Estimation of Variance Components" , 1968 .

[15]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[16]  Gary G. Koch,et al.  On the Hypotheses of 'No Interaction' in Contingency Tables , 1968 .

[17]  V. P. Bhapkar,et al.  On the analysis of contingency tables with a quantitative response. , 1968, Biometrics.

[18]  J. Overall Estimating Individual Rater Reliabilities from Analysis of Treatment Effects , 1968 .

[19]  Gary G. Koch,et al.  Hypotheses Of ‘No Interaction’ In Multi-dimensional Contingency Tables , 1968 .

[20]  Bhapkar Vp,et al.  On the analysis of contingency tables with a quantitative response. , 1968 .

[21]  G. Koch,et al.  Analysis of categorical data by linear models. , 1969, Biometrics.

[22]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[23]  G. Koch,et al.  The analysis of categorical data from mixed models , 1971 .

[24]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[25]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[26]  Domenic V. Cicchetti A new measure of agreement between rank ordered variables. , 1972 .

[27]  R. Loewenson,et al.  Reliability of measurements for studies of cerebrovascular atherosclerosis. , 1972, Biometrics.

[28]  Frank E. Grubbs,et al.  Errors of Measurement, Precision, Accuracy and the Statistical Comparison of Measuring Instruments , 1973 .

[29]  G G Koch,et al.  An analysis for compounded functions of categorical data. , 1973, Biometrics.

[30]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[31]  Gary G. Koch,et al.  A review of statistical methods in the analysis of data arising from observer reliability studies (Part II) , 1975 .

[32]  J. Fleiss Measuring agreement between two judges on the presence or absence of a trait. , 1975, Biometrics.

[33]  J. Richard Landis,et al.  A general methodology for the measurement of observer agreement when the data are categorical , 1975 .

[34]  G G Koch,et al.  A computer program for the generalized chi-square analysis of categorical data using weighted least squares (GENCAT). , 1976, Computer programs in biomedicine.

[35]  G G Koch,et al.  A general methodology for the analysis of experiments with repeated measurement of categorical data. , 1977, Biometrics.

[36]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.