Diversity of decision-making models and the measurement of interrater agreement.

Several papers have appeared criticizing the kappa coefficient because of its tendency to fluctuate with sample base rates. The importance of these criticisms is difficult to evaluate because they are presented with regards to a highly specific model of diagnostic decision making. In this article, diagnostic decision making is viewed as a special case of signal detection theory. Each diagnostic process is characterized by a function that relates the probability of a case receiving a positive diagnosis to the severity or salience of symptoms. The shape of this diagnosability curve greatly affects the value of kappa obtained in a study of interrater reliability, how it changes in response to variation in the base rates, and how closely it corresponds to the validity of diagnostic decisions. The common practice of evaluating a diagnostic procedure, when criterion diagnoses for comparison are unavailable, on the basis of the magnitude of the kappa coefficient observed in a reliability study is questionable. New methods for measuring interrater agreement are necessary, and possible directions for research in this area are discussed.