On the evaluation of probability judgments : calibration, resolution, and monotonicity

Judgments of probability are commonly evaluated by two criteria: calibration, namely, the correspondence between stated confidence and rate of occurrence, and resolution, namely, the ability to distinguish between events that do and do not occur. Two representations of probability judgments are contrasted: the designated form that presupposes a particular coding of outcomes (e.g., rain vs. no rain) and the inclusive form that incorporates all events and their complements. It is shown that the indices of calibration and resolution derived from these representations measure different characteristics of judgment. Calibration is distinguished from two types of overconfidence: specific and generic. An ordinal measure of performance is proposed and compared to the standard measures in forecasts of recession and in both numerical and verbal assessments of general knowledge. Much research on judgment under uncertainty has focused on the comparison of probability judgments with the corresponding relative frequency of occurrence. In a typical study, judges are presented with a series of prediction or knowledge problems and asked to assess the probability of the events in question. Judgments of probability or confidence are used both in research (Lichtenstein, Fischhoff, & Phillips, 1982; Wallsten & Budescu, 1983) and in practice. For example, weather forecasters often report the probability of rain (Murphy & Daan, 1985), and economists are sometimes called upon to estimate the chances of recession (Zarnowitz & Lambros, 1987). The two main criteria used to evaluate such judgments are calibration and resolution. A judge is said to be calibrated if his or her probability judgments match the corresponding relative frequency of occurrence. More specifically, consider all events to which the judge assigns a probability p; the judge is calibrated if the proportion of events in that class that actually occur equals p. Calibration is a desirable property, especially for communication, but it does not ensure informativeness. A judge can be properly calibrated and entirely noninformative if, for example, he or she predicts the sex of each newborn with

[1]  F. Sanders On Subjective Probability Forecasting , 1963 .

[2]  Linton C. Freeman,et al.  Order‐based statistics and monotonicity: A family of ordinal measures of association* , 1986 .

[3]  L. A. Goodman,et al.  Measures of Association for Cross Classifications. II: Further Discussion and References , 1959 .

[4]  David V. Budescu,et al.  Encoding subjective probabilities: A psychological and psychometric review , 1983 .

[5]  B. Fischhoff,et al.  Calibration of probabilities: the state of the art to 1980 , 1982 .

[6]  Rami Zwick,et al.  Comparing the calibration and coherence of numerical and verbal probability judgments , 1993 .

[7]  A. H. Murphy,et al.  Diagnostic verification of probability forecasts , 1992 .

[8]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[9]  Hubert M. Blalock,et al.  Measurement in the Social Sciences , 1974 .

[10]  T. O. Nelson,et al.  A comparison of current measures of the accuracy of feeling-of-knowing predictions , 1984 .

[11]  S. Oskamp OVERCONFIDENCE IN CASE-STUDY JUDGMENTS. , 1965, Journal of consulting psychology.

[12]  F. Mosteller,et al.  Quantifying Probabilistic Expressions , 1990 .

[13]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[14]  Robert H. Somers,et al.  A new asymmetric measure of association for ordinal variables. , 1962 .

[15]  Robert Fildes,et al.  Journal of business and economic statistics 5: Garcia-Ferrer, A. et al., Macroeconomic forecasting using pooled international data, (1987), 53-67 , 1988 .

[16]  A. H. Murphy,et al.  Scalar and Vector Partitions of the Probability Score: Part I. Two-State Situation , 1972 .

[17]  Ilan Yaniv,et al.  Measures of Discrimination Skill in Probabilistic Judgment , 1991 .

[18]  A. H. Murphy,et al.  Probability, Statistics, And Decision Making In The Atmospheric Sciences , 1985 .

[19]  Jae-On Kim,et al.  Predictive Measures of Ordinal Association , 1971, American Journal of Sociology.

[20]  Ilan Yaniv,et al.  A case study of expert judgment: Economists' probabilities versus base-rate model forecasts , 1992 .

[21]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[22]  G. Brier,et al.  External correspondence: Decompositions of the mean probability score , 1982 .

[23]  T. P. Wilson Measures of Association for Bivariate Ordinal Hypotheses , 1974 .

[24]  A. H. Murphy A New Vector Partition of the Probability Score , 1973 .

[25]  R. Dawes,et al.  Heuristics and Biases: Clinical versus Actuarial Judgment , 2002 .

[26]  Victor Zarnowitz,et al.  Rational Expectations and Macroeconomic Forecasts , 1985 .