A note on knowledge discovery and machine learning in digital soil mapping

In digital soil mapping, machine learning (ML) techniques are being used to infer a relationship between a soil property and the covariates. The information derived from this process is often translated into pedological knowledge. This mechanism is referred to as knowledge discovery. This study shows that knowledge discovery based on ML must be treated with caution. We show how pseudo-covariates can be used to accurately predict soil organic carbon in a hypothetical case study. We demonstrate that ML methods can find relevant patterns even when the covariates are meaningless and not related to soil-forming factors and processes. We argue that pattern recognition for prediction should not be equated with knowledge discovery. Knowledge discovery requires more than the recognition of patterns and successful prediction. It requires the pre-selection and preprocessing of pedologically relevant environmental covariates and the posterior interpretation and evaluation of the recognized patterns. We argue that important ML covariates could serve the purpose of providing elements to postulate hypotheses about soil processes that, once validated through experiments, could result in new pedological knowledge. Highlights: We discuss the rationale of knowledge discovery based on the most important machine learning covariates We use pseudo-covariates to predict topsoil organic carbon with random forest Soil organic carbon was accurately predicted in a hypothetical case study Pattern recognition by random forest should not be equated to knowledge discovery.

[1]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[2]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[3]  Karin Viergever,et al.  Knowledge discovery from models of soil properties developed through data mining , 2006 .

[4]  M. Wiesmeier,et al.  Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem , 2011, Plant and Soil.

[5]  Budiman Minasny,et al.  Pedology and digital soil mapping (DSM) , 2019, European Journal of Soil Science.

[6]  Mario Guevara,et al.  No silver bullet for digital soil mapping: country-specific soil organic carbon estimates across Latin America , 2018, SOIL.

[7]  Padhraic Smyth,et al.  Science and data science , 2017, Proceedings of the National Academy of Sciences.

[8]  T. Behrens,et al.  Spatial modelling with Euclidean distance fields and machine learning , 2018, European Journal of Soil Science.

[9]  Charles C. Driver,et al.  Continuous time structural equation modeling with R package ctsem , 2017 .

[10]  Gerard B. M. Heuvelink,et al.  Sampling design optimization for soil mapping with random forest , 2019 .

[11]  Marvin N. Wright,et al.  SoilGrids250m: Global gridded soil information based on machine learning , 2017, PloS one.

[12]  Yoan Fourcade,et al.  Paintings predict the distribution of species, or the challenge of selecting environmental predictors and evaluation statistics , 2018 .

[13]  Francesco Gullo,et al.  From Patterns in Data to Knowledge Discovery: What Data Mining Can Do☆ , 2015 .

[14]  H. Jenny,et al.  Factors of Soil Formation , 1941 .

[15]  Galit Shmueli,et al.  To Explain or To Predict? , 2010 .

[16]  John Wilford,et al.  Predicting regolith thickness in the complex weathering setting of the central Mt Lofty Ranges, South Australia , 2013 .

[17]  Marvin N. Wright,et al.  Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables , 2018, PeerJ.