General estimators for the reliability of qualitative data

We study a proportional reduction in loss (PRL) measure for the reliability of categorical data and consider the general case in which each ofN judges assigns a subject to one ofK categories. This measure has been shown to be equivalent to a measure proposed by Perreault and Leigh for a special case when there are two equally competent judges, and the correct category has a uniform prior distribution. We consider a general framework where the correct category is assumed to have an arbitrary prior distribution, and where classification probabilities vary by correct category, judge, and category of classification. In this setting, we consider PRL reliability measures based on two estimators of the correct category—the empirical Bayes estimator and an estimator based on the judges' consensus choice. We also discuss four important special cases of the general model and study several types of lower bounds for PRL reliability.

[1]  Roland T. Rust,et al.  Reliability and expected loss: A unifying principle , 1994 .

[2]  Donald B. Rubin,et al.  The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. , 1974 .

[3]  M. Woodroofe On Model Selection and the ARC Sine Laws , 1982 .

[4]  W. D. Perreault,et al.  Reliability of Nominal Data Based on Qualitative Judgments , 1989 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  H. A. David,et al.  Order Statistics (2nd ed). , 1981 .

[7]  S. Haberman Log-Linear Models for Frequency Tables Derived by Indirect Observation: Maximum Likelihood Equations , 1974 .

[8]  D. S. Sivia,et al.  Data Analysis , 1996, Encyclopedia of Evolutionary Psychological Science.

[9]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[10]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[11]  W. Batchelder,et al.  Culture as Consensus: A Theory of Culture and Informant Accuracy , 1986 .

[12]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[13]  J. Loevinger,et al.  The technic of homogeneous tests compared with some aspects of scale analysis and factor analysis. , 1948, Psychological bulletin.

[14]  Wim J. van der Linden,et al.  The Internal and External Optimality of Decisions Based on Tests , 1979 .

[15]  B. J. Winer Statistical Principles in Experimental Design , 1992 .

[16]  Marie Adele Hughes,et al.  Intercoder Reliability Estimation Approaches in Marketing: A Generalizability Theory Framework for Quantitative Data , 1990 .

[17]  H. Schouten,et al.  Measuring Pairwise Agreement Among Many Observers. II. Some Improvements and Additions , 1982 .

[18]  William H. Batchelder,et al.  New Results in Test Theory Without an Answer Key , 1989 .

[19]  L. Cronbach,et al.  Time-limit tests: Estimating their reliability and degree of speeding , 1951, Psychometrika.

[20]  W. Batchelder,et al.  Test theory without an answer key , 1988 .

[21]  Approximate Upper Percentage Points for Extreme Values in Multinomial Sampling , 1956 .

[22]  R. Rust,et al.  Model selection criteria: an investigation of relative accuracy, posterior probabilities, and combinations of criteria , 1995 .

[23]  W R Dillon,et al.  A Probabilistic Latent Class Model for Assessing Inter-Judge Reliability. , 1984, Multivariate behavioral research.

[24]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[25]  Harry Kesten,et al.  A PROPERTY OF THE MULTINOMIAL DISTRIBUTION , 1959 .

[26]  S. Lohr Statistics (2nd Ed.) , 1994 .

[27]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[28]  Herbert L. Costner,et al.  Criteria for Measures of Association , 1965 .

[29]  Hubert J. A. Schouten,et al.  Nominal scale agreement among observers , 1986 .