Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation

The areas beneath the relative (or receiver) operating characteristics (ROC) and relative operating levels (ROL) curves can be used as summary measures of forecast quality, but statistical significance tests for these areas are conducted infrequently in the atmospheric sciences. A development of signal‐detection theory, the ROC curve has been widely applied in the medical and psychology fields where significance tests and relationships to other common statistical methods have been established and described. This valuable literature appears to be largely unknown to the atmospheric sciences where applications of ROC and related techniques are becoming more common. This paper presents a survey of that literature with a focus on the interpretation of the ROC area in the field of forecast verification. We extend these foundations to demonstrate that similar principles can be applied to the interpretation and significance testing of the ROL area. It is shown that the ROC area is equivalent to the Mann–Whitney U‐statistic testing the significance of forecast event probabilities for cases where events actually occurred with those where events did not occur. A similar derivation shows that the ROL area is equivalent to the Mann–Whitney U‐statistic testing the magnitude of events with respect to whether or not an event has been forecast. Because the Mann–Whitney U‐statistic follows a known probability distribution, under certain assumptions it can be used to define the statistical significance of ROC and ROL areas and for comparing the areas of competing forecasts. For large samples the significance of either measure can be accurately assessed using a normal‐distribution approximation. Copyright © 2002 Royal Meteorological Society

[1]  H. F. Dodge,et al.  A method of sampling inspection , 1929 .

[2]  A. R. Crathorne,et al.  Economic Control of Quality of Manufactured Product. , 1933 .

[3]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[4]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[5]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[6]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[7]  W. W. Peterson,et al.  The theory of signal detectability , 1954, Trans. IRE Prof. Group Inf. Theory.

[8]  J SWETS,et al.  Decision processes in perception. , 1961, Psychological review.

[9]  M. Kendall,et al.  The advanced theory of statistics , 1945 .

[10]  J. Klotz The Wilcoxon, Ties, and the Computer , 1966 .

[11]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[12]  John A. Swets,et al.  Deferred decision in human signal detection: A preliminary experiment , 1967 .

[13]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[14]  An Approximation to the Wilcoxon-Mann-Whitney Distribution , 1969 .

[15]  D. Dorfman,et al.  Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals—Rating-method data , 1969 .

[16]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[17]  R. E. Odeh Algorithm AS 55: The Generalized Mann-Whitney U-Statistic , 1972 .

[18]  Byron J. T. Morgan,et al.  Some aspects of ROC curve-fitting: Normal and logistic models , 1972 .

[19]  J. Swets The Relative Operating Characteristic in Psychology , 1973, Science.

[20]  A. Simpson,et al.  What is the best index of detectability? , 1973, Psychological Bulletin.

[21]  B. Blakesley,et al.  A Generator for the Sampling Distribution of the Mann‐Whitney U Statistic , 1973 .

[22]  R. Sokal,et al.  Introduction to biostatistics , 1973 .

[23]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[24]  W. J. Conover,et al.  Rank Tests for One Sample, Two Samples, and $k$ samples Without the Assumption of a Continuous Distribution Function , 1973 .

[25]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[26]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[27]  M H Gail,et al.  A generalization of the one-sided two-sample Kolmogorov-Smirnov statistic for evaluating diagnostic tests. , 1976, Biometrics.

[28]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[29]  Ian Mason On Reducing Probability Forecasts to Yes/No Forecasts , 1979 .

[30]  J. Swets ROC analysis applied to the evaluation of medical imaging techniques. , 1979, Investigative radiology.

[31]  C. Metz,et al.  Statistical significance tests for binormal ROC curves , 1980 .

[32]  John A. Swets,et al.  Evaluation of diagnostic systems : methods from signal detection theory , 1982 .

[33]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[34]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[35]  E. F. Harding,et al.  An Efficient, Minimal‐Storage Procedure for Calculating the Mann‐Whitney U, Generalized U and Similar Distributions , 1984 .

[36]  C. Metz,et al.  A New Approach for Testing the Significance of Differences Between ROC Curves Measured from Correlated Data , 1984 .

[37]  T. O. Nelson,et al.  A comparison of current measures of the accuracy of feeling-of-knowing predictions , 1984 .

[38]  J. Falmagne Elements of psychophysical theory , 1985 .

[39]  R M Centor,et al.  An Evaluation of Methods for Estimating the Area Under the Receiver Operating Characteristic (ROC) Curve , 1985, Medical decision making : an international journal of the Society for Medical Decision Making.

[40]  T. O. Nelson ROC curves and measures of discrimination accuracy: a reply to Swets. , 1986, Psychological bulletin.

[41]  J. Swets Indices of discrimination or diagnostic accuracy: their ROCs and implied models. , 1986, Psychological bulletin.

[42]  D. McClish,et al.  Comparing the Areas under More Than Two Independent ROC Curves , 1987, Medical decision making : an international journal of the Society for Medical Decision Making.

[43]  A. H. Murphy,et al.  A General Framework for Forecast Verification , 1987 .

[44]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[45]  R. Preisendorfer,et al.  Principal Component Analysis in Meteorology and Oceanography , 1988 .

[46]  J. Hanley The Robustness of the "Binormal" Assumptions Used in Fitting ROC Curves , 1988, Medical decision making : an international journal of the Society for Medical Decision Making.

[47]  Herb A. Winston A Comparison of Three Radar-Based Severe-Storm-Detection Algorithms on Colorado High Plains Thunderstorms , 1988 .

[48]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[49]  C. Begg,et al.  Advances in statistical methodology for diagnostic medicine in the 1980's. , 1991, Statistics in medicine.

[50]  J. Hilden The Area under the ROC Curve and Its Competitors , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[51]  R. Centor Signal Detectability , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[52]  C F Hildebolt,et al.  Statistical analysis with receiver operating characteristic curves. , 1992, Radiology.

[53]  Lewis O. Harvey,et al.  The Application of Signal Detection Theory to Weather Forecasting Behavior , 1992 .

[54]  G. Campbell,et al.  Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. , 1994, Statistics in medicine.

[55]  X H Zhou,et al.  Testing an Underlying Assumption on a ROC Curve Based on Rating Data , 1995, Medical decision making : an international journal of the Society for Medical Decision Making.

[56]  F P Ottes,et al.  Statistical Comparison of ROC Curves from Multiple Readers , 1996, Medical decision making : an international journal of the Society for Medical Decision Making.

[57]  B. Turnbull,et al.  NONPARAMETRIC AND SEMIPARAMETRIC ESTIMATION OF THE RECEIVER OPERATING CHARACTERISTIC CURVE , 1996 .

[58]  John A. Swets,et al.  Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers , 1996 .

[59]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[60]  Roberto Buizza,et al.  Impact of Ensemble Size on Ensemble Prediction , 1998 .

[61]  Mats Hamrud,et al.  Impact of model resolution and ensemble size on the performance of an Ensemble Prediction System , 1998 .

[62]  A. Hollingsworth,et al.  Probabilistic Predictions of Precipitation Using the ECMWF Ensemble Prediction System , 1999 .

[63]  Nicholas E. Graham,et al.  Conditional Probabilities, Relative Operating Characteristics, and Relative Operating Levels , 1999 .

[64]  Phillip A. Arkin,et al.  The IRI Seasonal Climate Prediction System and the 1997/98 El Niño Event , 1999 .

[65]  T. Casey,et al.  Verification of Categorical Probability Forecasts , 2000 .

[66]  David S. Richardson,et al.  A probability and decision‐model analysis of PROVOST seasonal multi‐model ensemble integrations , 2000 .

[67]  David S. Richardson,et al.  ON THE ECONOMIC VALUE OF ENSEMBLE BASED WEATHER FORECASTS , 2001 .

[68]  Laurence J. Wilson,et al.  Comments on “Probabilistic Predictions of Precipitation Using the ECMWF Ensemble Prediction System” , 2000 .

[69]  H. Storch,et al.  Statistical Analysis in Climate Research , 2000 .

[70]  R. J. Graham,et al.  An assessment of seasonal predictability using atmospheric general circulation models , 2000 .

[71]  D. Richardson Skill and relative economic value of the ECMWF ensemble prediction system , 2000 .

[72]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[73]  R Boyd,et al.  Meme theory oversimplifies how culture changes. , 2000, Scientific American.

[74]  D. Stensrud,et al.  Evaluation of a Short-Range Multimodel Ensemble System , 2001 .

[75]  C. Thorncroft,et al.  A Dynamical Approach to Seasonal Prediction of Atlantic Tropical Cyclone Activity , 2001 .

[76]  D. Wilks A skill score based on economic value for probability forecasts , 2001 .

[77]  Roberto Buizza,et al.  Quantitative Precipitation Forecasts over the United States by the ECMWF Ensemble Prediction System , 2001 .

[78]  Roberto Buizza,et al.  Accuracy and Potential Economic Value of Categorical and Probabilistic Forecasts of Discrete Events , 2001 .

[79]  T. Iversen,et al.  Targeted ensemble prediction for northern Europe and parts of the north Atlantic Ocean , 2001 .