Predicting into unknown space? Estimating the area of applicability of spatial prediction models

Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the model has no knowledge about these environments, predictions have to be considered uncertain. Estimating the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the "area of applicability" (AOA) that we define as the area, for which the cross-validation error of the model applies. We first propose a "dissimilarity index" (DI) that is based on the minimum distance to the training data in the predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold based on the DI of the training data where the DI is calculated with respect to the cross-validation strategy used for model training. We test for the ideal threshold by using simulated data and compare the prediction error within the AOA with the cross-validation error of the model. We illustrate the approach using a simulated case study. Our simulation study suggests a threshold on DI to define the AOA at the .95 quantile of the DI in the training data. Using this threshold, the prediction error within the AOA is comparable to the cross-validation RMSE of the model, while the cross-validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross-validation is applied. We suggest to report the AOA alongside predictions, complementary to validation measures.

[1]  Simon N. Wood,et al.  Shape constrained additive models , 2015, Stat. Comput..

[2]  Marvin N. Wright,et al.  SoilGrids250m: Global gridded soil information based on machine learning , 2017, PloS one.

[3]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[4]  Damaris Zurell,et al.  Predicting to new environments: tools for visualizing model behaviour and impacts on mapped distributions , 2012 .

[5]  Alexander Brenning,et al.  Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data , 2019, Ecological Modelling.

[6]  K. Baumann,et al.  Chemoinformatic Classification Methods and their Applicability Domain , 2016, Molecular informatics.

[7]  Scott Boyer,et al.  Assessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models , 2014, J. Chem. Inf. Model..

[8]  Bruce L. Webber,et al.  Here be dragons: a tool for quantifying novelty due to covariate range and correlation change when projecting species distribution models , 2014 .

[9]  Jukka Heikkonen,et al.  Estimating the prediction performance of spatial models via spatial k-fold cross validation , 2017, Int. J. Geogr. Inf. Sci..

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[12]  Claude A. Garcia,et al.  The global tree restoration potential , 2019, Science.

[13]  Diana H. Wall,et al.  Soil nematode abundance and functional group composition at a global scale , 2019, Nature.

[14]  Alex Alves Freitas,et al.  A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood , 2016, Journal of Cheminformatics.

[15]  C. Bellard,et al.  virtualspecies, an R package to generate virtual species distributions , 2016 .

[16]  A. Ozgul,et al.  The ecological forecast horizon, and examples of its uses and determinants , 2015, bioRxiv.

[17]  N. Bystriakova,et al.  Sampling bias in geographic and environmental space and its effect on the predictive power of species distribution models , 2012 .

[18]  Roberto Todeschini,et al.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions , 2013, Journal of Cheminformatics.

[19]  Thomas Nauss,et al.  Importance of spatial predictor variable selection in machine learning applications - Moving from data reproduction to spatial prediction , 2019, Ecological Modelling.

[20]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[21]  H. Kulik,et al.  A Quantitative Uncertainty Metric Controls Error in Neural Network-Driven Chemical Discovery , 2019 .

[22]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[23]  A-Xing Zhu,et al.  Predictive soil mapping with limited sample data , 2015 .

[24]  A. Peterson,et al.  An evaluation of transferability of ecological niche models , 2018, Ecography.

[25]  Marvin N. Wright,et al.  Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables , 2018, PeerJ.

[26]  J. L. Parra,et al.  Very high resolution interpolated climate surfaces for global land areas , 2005 .

[27]  Jane Elith,et al.  blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models , 2018, bioRxiv.

[28]  R. Kadmon,et al.  EFFECT OF ROADSIDE BIAS ON THE ACCURACY OF PREDICTIVE MAPS PRODUCED BY BIOCLIMATIC MODELS , 2004 .

[29]  Tomislav Hengl,et al.  Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation , 2018, Environ. Model. Softw..

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  A. Stein,et al.  Soil sampling strategies for spatial prediction by correlation with auxiliary maps , 2003 .

[32]  Blas M. Benito,et al.  Late-spring frost risk between 1959 and 2017 decreased in North America but increased in Europe and Asia , 2020, Proceedings of the National Academy of Sciences.

[33]  G. Mangiatordi,et al.  Applicability Domain for QSAR models: where theory meets reality , 2016 .

[34]  Alexander Brenning,et al.  Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest , 2012, 2012 IEEE International Geoscience and Remote Sensing Symposium.

[35]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[36]  Philippe Lagacherie,et al.  Using quantile regression forest to estimate uncertainty of digital soil mapping products , 2017 .

[37]  Damaris Zurell,et al.  Outstanding Challenges in the Transferability of Ecological Models. , 2018, Trends in ecology & evolution.

[38]  N. Picard,et al.  Spatial validation reveals poor predictive performance of large-scale ecological mapping models , 2020, Nature Communications.

[39]  N. Fierer,et al.  A global atlas of the dominant bacteria found in soil , 2018, Science.

[40]  Jack Sullivan,et al.  Predicting plant conservation priorities on a global scale , 2018, Proceedings of the National Academy of Sciences.

[41]  Valerie A. Thomas,et al.  Approximating Prediction Uncertainty for Random Forest Regression Models , 2016 .