Measuring and comparing the accuracy of species distribution models with presence–absence data

Species distribution models have been widely used to predict species distributions for various purposes, including conservation planning, and climate change impact assessment. The success of these applications relies heavily on the accuracy of the models. Various measures have been proposed to assess the accuracy of the models. Rigorous statistical analysis should be incorporated in model accuracy assessment. However, since relevant information about the statistical properties of accuracy measures is scattered across various disciplines, ecologists find it difficult to select the most appropriate ones for their research. In this paper, we review accuracy measures that are currently used in species distribution modelling (SDM), and introduce additional metrics that have potential applications in SDM. For the commonly used measures (which are also intensively studied by statisticians), including overall accuracy, sensitivity, specificity, kappa, and area and partial area under the ROC curves, promising methods to construct confidence intervals and statistically compare the accuracy between two models are given. For other accuracy measures, methods to estimate standard errors are given, which can be used to construct approximate confidence intervals. We also suggest that as general tools, computer-intensive methods, especially bootstrap and randomization methods can be used in constructing confidence intervals and statistical tests if suitable analytic methods cannot be found. Usually, these computer-intensive methods provide robust results.

[1]  Janet E. Heffernan,et al.  Dependence Measures for Extreme Value Analyses , 1999 .

[2]  Chuhsing Kate Hsiao,et al.  Alternative Summary Indices for the Receiver Operating Characteristic Curve , 1996, Epidemiology.

[3]  P. Bossuyt,et al.  The diagnostic odds ratio: a single indicator of test performance. , 2003, Journal of clinical epidemiology.

[4]  Simon Ferrier,et al.  Evaluating the predictive performance of habitat models developed using logistic regression , 2000 .

[5]  David Gur,et al.  A permutation test sensitive to differences in areas for comparing ROC curves from a paired design , 2005, Statistics in medicine.

[6]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[7]  I. Jolliffe Uncertainty and Inference for Verification Measures , 2007 .

[8]  A Agresti,et al.  Summarizing the predictive power of a generalized linear model. , 2000, Statistics in medicine.

[9]  Z. Tu,et al.  A Better Confidence Interval for Kappa (κ) on Measuring Agreement between Two Raters with Binary Outcomes@@@A Better Confidence Interval for Kappa (k) on Measuring Agreement between Two Raters with Binary Outcomes , 1994 .

[10]  S. Manel,et al.  Evaluating presence-absence models in ecology: the need to account for prevalence , 2001 .

[11]  R. Pearson,et al.  Predicting species distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar , 2006 .

[12]  Forbes Ad,et al.  Classification-algorithm evaluation: five performance measures based on confusion matrices. , 1995 .

[13]  Jean L Freeman,et al.  A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. , 2002, Statistics in medicine.

[14]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[15]  James A. Hanley Standard error of the kappa statistic. , 1987 .

[16]  C. Metz,et al.  A receiver operating characteristic partial area index for highly sensitive diagnostic tests. , 1996, Radiology.

[17]  R. Newcombe Two-sided confidence intervals for the single proportion: comparison of seven methods. , 1998, Statistics in medicine.

[18]  R G Newcombe,et al.  Improved confidence intervals for the difference between binomial proportions based on paired data. , 1998, Statistics in medicine.

[19]  John T. Finn,et al.  Use of the Average Mutual Information Index in Evaluating Classification Error and Consistency , 1993, Int. J. Geogr. Inf. Sci..

[20]  A. H. Murphy The Finley Affair: A Signal Event in the History of Forecast Verification , 1996 .

[21]  M. Araújo,et al.  Equilibrium of species’ distributions with climate , 2005 .

[22]  M. Sykes,et al.  Methods and uncertainties in bioclimatic envelope modelling under climate change , 2006 .

[23]  R. Newcombe,et al.  Interval estimation for the difference between independent proportions: comparison of eleven methods. , 1998, Statistics in medicine.

[24]  M Schemper,et al.  Explained variation for logistic regression. , 1996, Statistics in medicine.

[25]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[26]  David R. B. Stockwell,et al.  Effects of sample size on accuracy of species distribution models , 2002 .

[27]  H. Kundel,et al.  Measurement of observer agreement. , 2003, Radiology.

[28]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[29]  Michael Escobar,et al.  Nonparametric statistical inference method for partial areas under receiver operating characteristic curves, with application to genomic studies , 2008, Statistics in medicine.

[30]  J. Drake,et al.  Modelling ecological niches with support vector machines , 2006 .

[31]  B J Biggerstaff,et al.  Comparing diagnostic tests: a simple graphic using likelihood ratios. , 2000, Statistics in medicine.

[32]  Paula Couto,et al.  Assessing the accuracy of spatial simulation models , 2003 .

[33]  A. Ash,et al.  R2: a useful measure of model performance when predicting a dichotomous outcome. , 1999, Statistics in medicine.

[34]  Christopher A. T. Ferro,et al.  A Probability Model for Verifying Deterministic Forecasts of Extreme Events , 2007 .

[35]  Bob Glahn,et al.  FORECASTER'S FORUM Discussion of Verification Concepts in Forecast Verification: A Practitioner's Guide in Atmospheric Science , 2004 .

[36]  N. Graham,et al.  Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation , 2002 .

[37]  A K Manatunga,et al.  Modeling kappa for measuring dependent categorical agreement data. , 2000, Biostatistics.

[38]  B. Reiser,et al.  Comparing the Areas Under Two Correlated ROC Curves: Parametric and Non‐Parametric Approaches , 2006, Biometrical journal. Biometrische Zeitschrift.

[39]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[40]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[41]  F. Kienast,et al.  Predicting the potential distribution of plant species in an alpine environment , 1998 .

[42]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[43]  D J Hand,et al.  Statistical methods in diagnosis , 1992, Statistical methods in medical research.

[44]  A. H. Murphy,et al.  Probability Forecasting in Meteorology , 1984 .

[45]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[46]  Chen-Tuo Liao,et al.  On the exact interval estimation for the difference in paired areas under the ROC curves , 2008, Statistics in medicine.

[47]  Tempei Hashino,et al.  Sampling Uncertainty and Confidence Intervals for the Brier Score and Brier Skill Score , 2008 .

[48]  Huiman X Barnhart,et al.  Weighted Least‐Squares Approach for Comparing Correlated Kappa , 2002, Biometrics.

[49]  S E Vollset,et al.  Confidence intervals for a binomial proportion. , 1994, Statistics in medicine.

[50]  Robert F. Tate,et al.  Correlation Between a Discrete and a Continuous Variable. Point-Biserial Correlation , 1954 .

[51]  H. Kraemer Correlation coefficients in medical research: from product moment correlation to the odds ratio , 2006, Statistical methods in medical research.

[52]  Omri Allouche,et al.  Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) , 2006 .

[53]  J. Fleiss,et al.  Interval estimation under two study designs for kappa with binary classifications. , 1993, Biometrics.

[54]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[55]  J. Yerushalmy Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. , 1947, Public health reports.

[56]  Gengsheng Qin,et al.  Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test , 2008, Statistical methods in medical research.

[57]  David B. Stephenson,et al.  The extreme dependency score: a non‐vanishing measure for forecasts of rare events , 2008 .

[58]  J A Hanley,et al.  Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. , 1997, Academic radiology.

[59]  J. B. Garner,et al.  The standard error of Cohen's Kappa. , 1991, Statistics in medicine.

[60]  Mitchell H. Gail,et al.  A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data , 1989 .

[61]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[62]  A. Guisan,et al.  An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data , 2004 .

[63]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[64]  T. Dawson,et al.  Selecting thresholds of occurrence in the prediction of species distributions , 2005 .

[65]  H. Kraemer,et al.  2 x 2 kappa coefficients: measures of agreement or association. , 1989, Biometrics.

[66]  N. Raes,et al.  A null‐model for significance testing of presence‐only species distribution models , 2007 .

[67]  David J. Hand,et al.  Measuring Diagnostic Accuracy of Statistical Prediction Rules , 2001 .

[68]  Neil Klar,et al.  An exact bootstrap confidence interval for κ in small samples , 2002 .

[69]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[70]  D L Riddle,et al.  Interpreting validity indexes for diagnostic tests: an illustration using the Berg balance test. , 1999, Physical therapy.

[71]  A. Albert,et al.  A bootstrap method for comparing correlated kappa coefficients , 2008 .

[72]  V. Flack Confidence intervals for the interrater agreement measure kappa , 1987 .

[73]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[74]  N E Hawass,et al.  Comparing the sensitivities and specificities of two diagnostic procedures performed on the same group of patients. , 1997, The British journal of radiology.

[75]  M. Schemper Predictive accuracy and explained variation , 2003, Statistics in medicine.

[76]  Jiun-Kae Jack Lee,et al.  A Better Confidence Interval for Kappa (κ) on Measuring Agreement between Two Raters with Binary Outcomes , 1994 .

[77]  Nikolaos M. Avouris,et al.  EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM , 2006, Appl. Artif. Intell..

[78]  Russell Zaretzki,et al.  The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic Tests , 2007, Biometrics.

[79]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[80]  D. Richardson,et al.  Niche‐based modelling as a tool for predicting the risk of alien plant invasions at a global scale , 2005, Global change biology.

[81]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[82]  D M Clarke,et al.  Comparing correlated kappas by resampling: is one level of agreement significantly different from another? , 1996, Journal of psychiatric research.

[83]  C S Peirce,et al.  The numerical measure of the success of predictions. , 1884, Science.

[84]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[85]  Jialiang Li,et al.  Prevalence‐dependent diagnostic accuracy measures , 2007, Statistics in medicine.

[86]  D. Stephenson Use of the “Odds Ratio” for Diagnosing Forecast Skill , 2000 .

[87]  Gene V. Glass,et al.  Note on Rank Biserial Correlation , 1966 .

[88]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[89]  Lalit Kumar,et al.  Comparative assessment of the measures of thematic classification accuracy , 2007 .

[90]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[91]  Donald G. Thomas Algorithm AS 36: Exact Confidence Limits for the Odds Ratio in a 2 × 2 Table , 1971 .

[92]  J. Lee,et al.  Bootstrap estimate of the variance and confidence interval of kappa. , 1991, British journal of industrial medicine.

[93]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[94]  F. Woodcock,et al.  The Evaluation of Yes/No Forecasts for Scientific and Administrative Purposes , 1976 .

[95]  Wen-Chung Lee,et al.  Probabilistic analysis of global performances of diagnostic tests: interpreting the Lorenz curve-based summary measures. , 1999, Statistics in medicine.

[96]  Heinz Holling,et al.  Revisiting youden's index as a useful measure of the misclassification error in meta-analysis of diagnostic studies , 2008, Statistical methods in medical research.

[97]  J. Koval,et al.  Interval estimation for Cohen's kappa as a measure of agreement. , 2000, Statistics in medicine.

[98]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .