A Critical Review of Spatial Predictive Modeling Process in Environmental Sciences with Reproducible Examples in R

Spatial predictive methods are increasingly being used to generate predictions across various disciplines in environmental sciences. Accuracy of the predictions is critical as they form the basis for environmental management and conservation. Therefore, improving the accuracy by selecting an appropriate method and then developing the most accurate predictive model(s) is essential. However, it is challenging to select an appropriate method and find the most accurate predictive model for a given dataset due to many aspects and multiple factors involved in the modeling process. Many previous studies considered only a portion of these aspects and factors, often leading to sub-optimal or even misleading predictive models. This study evaluates a spatial predictive modeling process, and identifies nine major components for spatial predictive modeling. Each of these nine components is then reviewed, and guidelines for selecting and applying relevant components and developing accurate predictive models are provided. Finally, reproducible examples using spm, an R package, are provided to demonstrate how to select and develop predictive models using machine learning, geostatistics, and their hybrid methods according to predictive accuracy for spatial predictive modeling; reproducible examples are also provided to generate and visualize spatial predictions in environmental sciences.

[1]  S. Williams,et al.  How do species respond to climate change along an elevation gradient? A case study of the grey‐headed robin (Heteromyias albispecularis) , 2009 .

[2]  Marc Van Meirvenne,et al.  Multivariate geostatistics for the predictive modelling of the surficial sand distribution in shelf seas , 2006 .

[3]  Ross M. Welch,et al.  Kriging on highly skewed data for DTPA-extractable soil Zn with auxiliary information for pH and organic carbon , 2006 .

[4]  J. Hernández‐Stefanoni,et al.  Mapping the Spatial Variability of Plant Diversity in a Tropical Forest: Comparison of Spatial Interpolation Methods , 2006, Environmental monitoring and assessment.

[5]  Zhi Huang,et al.  Performance of predictive models in marine benthic environments based on predictions of sponge distribution on the Australian continental shelf , 2011, Ecol. Informatics.

[6]  Budiman Minasny,et al.  Uncertainty analysis for soil‐terrain models , 2006, Int. J. Geogr. Inf. Sci..

[7]  Jerald B. Johnson,et al.  Model selection in ecology and evolution. , 2004, Trends in ecology & evolution.

[8]  Brendan P. Brooke,et al.  On the use of abiotic surrogates to describe marine benthic biodiversity , 2010 .

[9]  Tim Appelhans,et al.  Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania , 2015 .

[10]  Jeffrey G. Arnold,et al.  Model Evaluation Guidelines for Systematic Quantification of Accuracy in Watershed Simulations , 2007 .

[11]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[12]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[13]  C. Lucas,et al.  Spatial interpolation of McArthur's Forest Fire Danger Index across Australia: Observational study , 2013, Environ. Model. Softw..

[14]  Alain F. Zuur,et al.  A protocol for data exploration to avoid common statistical problems , 2010 .

[15]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[16]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[17]  A. Olsen,et al.  Spatially Balanced Sampling of Natural Resources , 2004 .

[18]  Roberto Benedetti,et al.  A spatially balanced design with probability function proportional to the within sample distance , 2017, Biometrical journal. Biometrische Zeitschrift.

[19]  Antoine Guisan,et al.  Measuring the relative effect of factors affecting species distribution model predictions , 2014 .

[20]  R. M. Lark,et al.  Mapping risk of soil nutrient deficiency or excess by disjunctive and indicator kriging , 2004 .

[21]  Jovan M. Tadić,et al.  Examination of geostatistical and machine-learning techniques as interpolators in anisotropic atmospheric environments , 2015 .

[22]  Jin Li,et al.  Assessing spatial predictive models in the environmental sciences: Accuracy measures, data variation and variance explained , 2016, Environ. Model. Softw..

[23]  Zhi Huang,et al.  Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness , 2017, Environ. Model. Softw..

[24]  B. Tulskaya Neural Network Residual Kriging Application for Climatic Data , 2002 .

[25]  Peter A. Vanrolleghem,et al.  Uncertainty in the environmental modelling process - A framework and guidance , 2007, Environ. Model. Softw..

[26]  Joseph H. A. Guillaume,et al.  Characterising performance of environmental models , 2013, Environ. Model. Softw..

[27]  Markus Diesing,et al.  Combining observations with acoustic swath bathymetry and backscatter to map seabed sediment texture classes: The empirical best linear unbiased predictor , 2015 .

[28]  Junfei Chen,et al.  Statistical Uncertainty Estimation Using Random Forests and Its Application to Drought Forecast , 2012 .

[29]  Jin Li,et al.  Application of machine learning methods to spatial interpolation of environmental variables , 2011, Environ. Model. Softw..

[30]  P. Thompson,et al.  Baseline biogeochemical data from Australia’s continental margin links seabed sediments to water column characteristics , 2017 .

[31]  Roger D. Peng,et al.  What is the question? , 2015, Science.

[32]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[33]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[34]  W. Walker,et al.  Defining Uncertainty: A Conceptual Basis for Uncertainty Management in Model-Based Decision Support , 2003 .

[35]  Markus Diesing,et al.  Towards Quantitative Spatial Models of Seabed Sediment Composition , 2015, PloS one.

[36]  Jin Li,et al.  Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what? , 2017, PloS one.

[37]  T. Hastie,et al.  Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees , 2006 .

[38]  M. Austin Species distribution models and ecological theory: A critical assessment and some possible new approaches , 2007 .

[39]  Zhi Huang,et al.  Predicting Seabed Hardness Using Random Forest in R , 2014 .

[40]  Jin Li,et al.  Spatial interpolation methods applied in the environmental sciences: A review , 2014, Environ. Model. Softw..

[41]  Maggie Tran,et al.  Selecting Optimal Random Forest Predictive Models: A Case Study on Predicting the Spatial Distribution of Seabed Hardness , 2016, PloS one.

[42]  Roberto Benedetti,et al.  Spatially Balanced Sampling: A Review and A Reappraisal , 2017 .

[43]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[44]  T. Hastie,et al.  Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions , 2006 .

[45]  V. Van Lancker,et al.  Geostatistical modeling of sedimentological parameters using multi‐scale terrain variables: application along the Belgian Part of the North Sea , 2009, Int. J. Geogr. Inf. Sci..

[46]  Avi Ostfeld,et al.  Evolutionary algorithms and other metaheuristics in water resources: Current status, research challenges and future directions , 2014, Environ. Model. Softw..

[47]  Sayan Mukherjee,et al.  Bayesian group factor analysis with structured sparsity , 2016, J. Mach. Learn. Res..

[48]  Vijay P. Singh,et al.  Estimating Spatial Precipitation Using Regression Kriging and Artificial Neural Network Residual Kriging (RKNNRK) Hybrid Approach , 2015, Water Resources Management.

[49]  Marc Voltz,et al.  A comparison of kriging, cubic splines and classification for predicting soil properties from sample information , 1990 .

[50]  Jinfeng Wang,et al.  A review of spatial sampling , 2012 .

[51]  D. Hilbert,et al.  LIVES: a new habitat modelling technique for predicting the distribution of species’ occurrences using presence-only data based on limiting factor theory , 2008, Biodiversity and Conservation.

[52]  M. Purss,et al.  Topic 21: Discrete Global Grid Systems Abstract Specification , 2017 .

[53]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[54]  Jin Li,et al.  A review of comparative studies of spatial interpolation methods in environmental sciences: Performance and impact factors , 2011, Ecol. Informatics.

[55]  G. Heuvelink,et al.  Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions , 2015, PloS one.

[56]  Huang,et al.  Developing an Optimal Spatial Predictive Model for Seabed Sand Content Using Machine Learning, Geostatistics, and Their Hybrid Methods , 2019, Geosciences.

[57]  Guodong Liu,et al.  Application of a Hybrid Interpolation Method Based on Support Vector Machine in the Precipitation Spatial Interpolation of Basins , 2017 .

[58]  Hsin-Cheng Huang,et al.  Optimal Geostatistical Model Selection , 2007 .

[59]  Markus Diesing,et al.  A Comparison of Supervised Classification Methods for the Prediction of Substrate Type Using Multibeam Acoustic and Legacy Grain-Size Data , 2014, PloS one.

[60]  Jo Wood,et al.  Where is Helvellyn? Fuzziness of multi‐scale landscape morphometry , 2004 .

[61]  R. O’Brien,et al.  A Caution Regarding Rules of Thumb for Variance Inflation Factors , 2007 .

[62]  Anthony J. Jakeman,et al.  Ten iterative steps in development and evaluation of environmental models , 2006, Environ. Model. Softw..

[63]  Omri Allouche,et al.  Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) , 2006 .

[64]  Zhi Huang,et al.  Characterising sediments of a tropical sediment-starved shelf using cluster analysis of physical and geochemical variables , 2015 .

[65]  Hugh Sweatman,et al.  Spatially balanced designs that incorporate legacy sites , 2017 .

[66]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[67]  Jane Elith,et al.  Error and uncertainty in habitat models , 2006 .

[68]  Valerie A. Thomas,et al.  Approximating Prediction Uncertainty for Random Forest Regression Models , 2016 .

[69]  Anthony D. Arthur,et al.  Influence of woody vegetation on pollinator densities in oilseed Brassica fields in an Australian temperate landscape , 2010 .

[70]  J. Bouma,et al.  Use of soil-map delineations to improve (Co-)kriging of point data on moisture deficits , 1988 .

[71]  J. Elith,et al.  Species Distribution Models: Ecological Explanation and Prediction Across Space and Time , 2009 .

[72]  Mathieu Marmion,et al.  The performance of state-of-the-art modelling techniques depends on geographical distribution of species. , 2009 .

[73]  Edzer J. Pebesma,et al.  Multivariable geostatistics in S: the gstat package , 2004, Comput. Geosci..

[74]  Trevor J. Hastie,et al.  Confidence intervals for random forests: the jackknife and the infinitesimal jackknife , 2013, J. Mach. Learn. Res..

[75]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[76]  Markus Diesing,et al.  Image-based seabed classification: what can we learn from terrestrial remote sensing? , 2016 .

[77]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[78]  Hans-Peter Piepho,et al.  Quantifying uncertainty on sediment loads using bootstrap confidence intervals. , 2016 .

[79]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .

[80]  Zhi Huang,et al.  Predictive modelling of seabed sediment parameters using multibeam acoustic data: a case study on the Carnarvon Shelf, Western Australia , 2012, Int. J. Geogr. Inf. Sci..

[81]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[82]  M. Van Meirvenne,et al.  Kriging soil texture under different types of nonstationarity , 2003 .

[83]  Michael A. Huston,et al.  Hidden treatments in ecological experiments: re-evaluating the ecosystem function of biodiversity , 1997, Oecologia.