Importance of spatial predictor variable selection in machine learning applications - Moving from data reproduction to spatial prediction

Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for reliable spatial predictions. We introduce two case studies that use remote sensing to predict land cover and the leaf area index for the "Marburg Open Forest", an open research and education site of Marburg University, Germany. We use the machine learning algorithm Random Forests to train models using non-spatial and spatial cross-validation strategies to understand how spatial variable selection affects the predictions. Our findings confirm that spatial cross-validation is essential in preventing overoptimistic model performance. We further show that highly autocorrelated predictors (such as geolocation variables, e.g. latitude, longitude) can lead to considerable overfitting and result in models that can reproduce the training data but fail in making spatial predictions. The problem becomes apparent in the visual assessment of the spatial predictions that show clear artefacts that can be traced back to a misinterpretation of the spatially autocorrelated predictors by the algorithm. Spatial variable selection could automatically detect and remove such variables that lead to overfitting, resulting in reliable spatial prediction patterns and improved statistical spatial model performance. We conclude that in addition to spatial validation, a spatial variable selection must be considered in spatial predictions of ecological data to produce reliable predictions.

[1]  Fabio Terribile,et al.  High-resolution space–time rainfall analysis using integrated ANN inference systems , 2010 .

[2]  Fan Yang,et al.  Precise estimation of soil organic carbon stocks in the northeast Tibetan Plateau , 2016, Scientific Reports.

[3]  B. McGill,et al.  Testing the predictive performance of distribution models , 2013 .

[4]  Louise Willemen,et al.  Machine Learning Using Hyperspectral Data Inaccurately Predicts Plant Traits Under Spatial Dependency , 2018, Remote. Sens..

[5]  Lin Wang,et al.  Mapping Annual Precipitation across Mainland China in the Period 2001-2010 from TRMM3B43 Product Using Spatial Downscaling Approach , 2015, Remote. Sens..

[6]  T. Behrens,et al.  Spatial modelling with Euclidean distance fields and machine learning , 2018, European Journal of Soil Science.

[7]  Morteza Sadeghi,et al.  A statistical framework for estimating air temperature using MODIS land surface temperature data , 2017 .

[8]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[9]  Lukas Gudmundsson,et al.  Towards observation-based gridded runoff estimates for Europe , 2014 .

[10]  Tim Appelhans,et al.  Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania , 2015 .

[11]  Lei Deng,et al.  Prediction of aboveground grassland biomass on the Loess Plateau, China, using a random forest algorithm , 2017, Scientific Reports.

[12]  G. Groom,et al.  Spatial application of Random Forest models for fine-scale coastal vegetation classification using object based analysis of aerial orthophoto and DEM data , 2015, Int. J. Appl. Earth Obs. Geoinformation.

[13]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[14]  Jane Elith,et al.  blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models , 2018, bioRxiv.

[15]  Pierre Roudier,et al.  Mapping Daily Air Temperature for Antarctica Based on MODIS LST , 2016, Remote. Sens..

[16]  Tomislav Hengl,et al.  Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation , 2018, Environ. Model. Softw..

[17]  Falk Huettmann,et al.  Predictions from machine learning ensembles: marine bird distribution and density on Canada’s Pacific coast , 2017 .

[18]  Catherine Linard,et al.  Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling , 2019, Geocarto International.

[19]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[20]  Jin Li,et al.  Application of machine learning methods to spatial interpolation of environmental variables , 2011, Environ. Model. Softw..

[21]  Matthew J. Cracknell,et al.  Geological mapping using remote sensing data: A comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information , 2014, Comput. Geosci..

[22]  Lukas W. Lehnert,et al.  From local spectral measurements to maps of vegetation cover and biomass on the Qinghai-Tibet-Plateau: Do we need hyperspectral information? , 2017, Int. J. Appl. Earth Obs. Geoinformation.

[23]  Vincent Bretagnolle,et al.  Spatial leave‐one‐out cross‐validation for variable selection in the presence of spatial autocorrelation , 2014 .

[24]  Aniruddha Ghosh,et al.  A comparison of selected classification algorithms for mapping bamboo patches in lower Gangetic plains using very high resolution WorldView 2 imagery , 2014, Int. J. Appl. Earth Obs. Geoinformation.

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  George Alan Blackburn,et al.  How well does random forest analysis model deforestation and forest fragmentation in the Brazilian Atlantic forest? , 2017, Environmental and Ecological Statistics.

[27]  Luca Montanarella,et al.  Prediction of Soil Organic Carbon at the European Scale by Visible and Near InfraRed Reflectance Spectroscopy , 2013, PloS one.

[28]  Tomislav Hengl,et al.  Spatio-temporal interpolation of soil water, temperature, and electrical conductivity in 3D + T: The Cook Agronomy Farm data set , 2015 .

[29]  Marvin N. Wright,et al.  Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables , 2018, PeerJ.

[30]  Mikhail Kanevski,et al.  Machine Learning Feature Selection Methods for Landslide Susceptibility Mapping , 2013, Mathematical Geosciences.

[31]  Shashi Shekhar,et al.  Transdisciplinary Foundations of Geospatial Data Science , 2017, ISPRS Int. J. Geo Inf..

[32]  Thomas Nauss,et al.  Revealing the potential of spectral and textural predictor variables in a neural network-based rainfall retrieval technique , 2017 .

[33]  Roberta E. Martin,et al.  A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping , 2014, PloS one.

[34]  Jukka Heikkonen,et al.  Estimating the prediction performance of spatial models via spatial k-fold cross validation , 2017, Int. J. Geogr. Inf. Sci..

[35]  Pierre-Alain Danis,et al.  Identification of ecological thresholds from variations in phytoplankton communities among lakes: contribution to the definition of environmental standards , 2016, Environmental Monitoring and Assessment.

[36]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[37]  Andreas Huth,et al.  Using airborne LiDAR to assess spatial heterogeneity in forest structure on Mount Kilimanjaro , 2017, Landscape Ecology.

[38]  Francisco Alonso-Sarría,et al.  Modification of the random forest algorithm to avoid statistical dependence problems when classifying remote sensing imagery , 2017, Comput. Geosci..

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  M. Alfò,et al.  Evaluating the effects of climate change on tree species abundance and distribution in the Italian peninsula , 2011 .

[41]  Julian D. Olden,et al.  Assessing transferability of ecological models: an underappreciated aspect of statistical validation , 2012 .

[42]  Eric S Walsh,et al.  A Random Forest approach to predict the spatial distribution of sediment pollution in an estuarine system , 2017, PloS one.

[43]  Yaping Yang,et al.  A Comparison of Different Regression Algorithms for Downscaling Monthly Satellite-Based Precipitation over North China , 2016, Remote. Sens..

[44]  Alexander Brenning,et al.  Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data , 2019, Ecological Modelling.

[45]  Amir Hossein Alavi,et al.  Machine learning in geosciences and remote sensing , 2016 .

[46]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[47]  A. Brenning Spatial prediction models for landslide hazards: review, comparison and evaluation , 2005 .

[48]  Craig S. T. Daughtry,et al.  A visible band index for remote sensing leaf chlorophyll content at the canopy scale , 2013, Int. J. Appl. Earth Obs. Geoinformation.

[49]  Thomas C. Edwards,et al.  Machine learning for predicting soil classes in three semi-arid landscapes , 2015 .