Comparison of statistical methods commonly used in predictive modelling

Abstract Logistic Multiple Regression, Principal Component Regression and Classification and Regression Tree Analysis (CART), commonly used in ecological modelling using GIS, are compared with a relatively new statistical technique, Multivariate Adaptive Regression Splines (MARS), to test their accuracy, reliability, implementation within GIS and ease of use. All were applied to the same two data sets, covering a wide range of conditions common in predictive modelling, namely geographical range, scale, nature of the predictors and sampling method. We ran two series of analyses to verify if model validation by an independent data set was required or cross-validation on a learning data set sufficed. Results show that validation by independent data sets is needed. Model accuracy was evaluated using the area under Receiver Operating Characteristics curve (AUC). This measure was used because it summarizes performance across all possible thresholds, and is independent of balance between classes. MARS and Regression Tree Analysis achieved the best prediction success, although the CART model was difficult to use for cartographic purposes due to the high model complexity. Abbreviations: AUC = Area under the ROC curve; CART = Classification Regression Trees; FN = False negative; FP = False positive; GAM = Generalized Additive Model; GIS = Geographic Information System; GLM = Generalized Linear Model; LMR = Logistic Multiple Regression; MARS = Multivariate Adaptive Regression Splines; NDVI = Normalized Difference Vegetation Index; PCR = Principal Components Regression; ROC = Receiver Operating Characteristics.

[1]  Roxanne I. Hastings BOOK REVIEW: A World Synopsis of the Genus Grimmia (Musci, Grimmiaceae). , 2002 .

[2]  B. Lees,et al.  A new method for predicting vegetation distributions using decision tree analysis in a geographic information system , 1991 .

[3]  David J. Mladenoff,et al.  Predicting gray wolf landscape recolonization: logistic regression models vs. new field data , 1999 .

[4]  A. O. Nicholls,et al.  Determining species response functions to an environmental gradient by means of a β‐function , 1994 .

[5]  Cornelis W. P. M. Blom,et al.  Vegetation zonation in a former tidal area: A vegetation-type response model based on DCA and logistic regression using GIS , 1996 .

[6]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[7]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[8]  Trevor Hastie,et al.  Generalized linear and generalized additive models in studies of species distributions: setting the scene , 2002 .

[9]  Jesús Muñoz,et al.  A REVISION OF GRIMMIA (MUSCI, GRIMMIACEAE) IN THE AMERICAS. 1 : LATIN AMERICA , 1999 .

[10]  T. Yee,et al.  Generalized additive models in plant ecology , 1991 .

[11]  S. Manel,et al.  Comparing discriminant analysis, neural networks and logistic regression for predicting species distributions: a case study with a Himalayan river bird , 1999 .

[12]  A. Prasad,et al.  PREDICTING ABUNDANCE OF 80 TREE SPECIES FOLLOWING CLIMATE CHANGE IN THE EASTERN UNITED STATES , 1998 .

[13]  Ángel M. Felicísimo Modeling the Potential Distribution of Forests with a GIS , 2002 .

[14]  A. O. Nicholls,et al.  Measurement of the realized qualitative niche: environmental niches of five Eucalyptus species , 1990 .

[15]  M. Austin,et al.  A new model for the continuum concept , 1989, Vegetatio.

[16]  Sunil Narumalani,et al.  Aquatic macrophyte modeling using GIS and logistic multiple regression , 1997 .

[17]  Berthold K. P. Horn,et al.  Hill shading and the reflectance map , 1981, Proceedings of the IEEE.

[18]  Richard D. De Veaux,et al.  Multicollinearity: A tale of two nonparametric regressions , 1994 .

[19]  S. T. Buckland,et al.  An autologistic model for the spatial distribution of wildlife , 1996 .

[20]  Daniel G. Brown Predicting vegetation types at treeline using topography and biophysical disturbance variables , 1994 .

[21]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[22]  F. Kienast,et al.  Predicting the potential distribution of plant species in an alpine environment , 1998 .

[23]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[24]  Knut Rydgren,et al.  Species response curves along environmental gradients. A case study from SE Norwegian swamp forests , 2003 .