Comparison of three supervised learning methods for digital soil mapping: Application to a complex terrain in the Ecuadorian Andes

A digital soil mapping approach is applied to a complex, mountainous terrain in the Ecuadorian Andes. Relief features are derived from a digital elevation model and used as predictors for topsoil texture classes sand, silt, and clay. The performance of three statistical learning methods is compared: linear regression, random forest, and stochastic gradient boosting of regression trees. In linear regression, a stepwise backward variable selection procedure is applied and overfitting is controlled by minimizing Mallow’s Cp. For random forest and boosting, the effect of predictor selection and tuning procedures is assessed. 100-fold repetitions of a 5-fold cross-validation of the selected modelling procedures are employed for validation, uncertainty assessment, and method comparison. Absolute assessment of model performance is achieved by comparing the prediction error of the selected method and the mean. Boosting performs best, providing predictions that are reliably better than the mean. The median reduction of the root mean square error is around 5%. Elevation is the most important predictor. All models clearly distinguish ridges and slopes. The predicted texture patterns are interpreted as result of catena sequences (eluviation of fine particles on slope shoulders) and landslides (mixing up mineral soil horizons on slopes).

[1]  John F. O'Callaghan,et al.  The extraction of drainage networks from digital elevation data , 1984, Comput. Vis. Graph. Image Process..

[2]  Frédéric Darboux,et al.  A fast, simple and versatile algorithm to fill the depressions of digital elevation models , 2002 .

[3]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[4]  Murray K. Clayton,et al.  Potential terrain controls on soil color, texture contrast and grain-size deposition for the original catena landscape in Uganda , 2004 .

[5]  Gerard Govers,et al.  A GIS procedure for automatically calculating the USLE LS factor on topographically complex landscape units , 1996 .

[6]  Michael Richter,et al.  Landslides as Important Disturbance Regimes — Causes and Regeneration , 2008 .

[7]  W. Marsden I and J , 2012 .

[8]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[9]  Jörg Bendix,et al.  The Ecosystem (Reserva Biológica San Francisco) , 2008 .

[10]  S. de Bruin,et al.  Soil-landscape modelling using fuzzy c-means clustering of attribute data derived from a Digital Elevation Model (DEM) , 1998 .

[11]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[12]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[13]  M. Köhn Bemerkungen zur mechanischen Bodenanalyse. II , 1927 .

[14]  M. Wiesmeier,et al.  Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem , 2011, Plant and Soil.

[15]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[16]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[17]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[18]  Dominique Arrouays,et al.  Extrapolating regional soil landscapes from an existing soil map: Sampling intensity, validation procedures, and integration of spatial context , 2008 .

[19]  A. Hossain,et al.  A comparative study on detection of influential observations in linear regression , 1991 .

[20]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[21]  J. Homeier,et al.  Soil properties and tree growth along an altitudinal transect in Ecuadorian tropical montane forest , 2008 .

[22]  R. V. Rossel,et al.  Using data mining to model and interpret soil diffuse reflectance spectra. , 2010 .

[23]  R. G. Davies,et al.  Methods to account for spatial autocorrelation in the analysis of species distributional data : a review , 2007 .

[24]  R. Schneider,et al.  Vegetations- und Agrarlandschaftsstrukturen in den Bergwäldern Südecuadors , 2004 .

[25]  R. Simonson,et al.  Outline of a Generalized Theory of Soil Genesis , 1959 .

[26]  C. Thorne,et al.  Quantitative analysis of land surface topography , 1987 .

[27]  Anne Gobin,et al.  Soil-landscape modelling to quantify spatial variability of soil texture. Modelling of transport pro , 1999 .

[28]  A. N. Strahler Quantitative analysis of watershed geomorphology , 1957 .

[29]  Alex B. McBratney,et al.  An overview of pedometric techniques for use in soil survey , 2000 .

[30]  Carsten Rahbek,et al.  The patterns and causes of elevational diversity gradients , 2012 .

[31]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[32]  Jörg Bendix,et al.  Central data services in multidisciplinary environmental research projects , 2007 .

[33]  J. Peters,et al.  Random forests as a tool for ecohydrological distribution modelling , 2007 .

[34]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[35]  Alex B. McBratney,et al.  Elucidation of soil-landform interrelationships by canonical ordination analysis , 1991 .

[36]  R. Cook Detection of influential observation in linear regression , 2000 .

[37]  J. Nash,et al.  River flow forecasting through conceptual models part I — A discussion of principles☆ , 1970 .

[38]  Martin Hitziger,et al.  The Sloping Mire Soil-Landscape of Southern Ecuador: Influence of Predictor Resolution and Model Tuning on Random Forest Predictions , 2014 .

[39]  B. J,et al.  Soil regionalisation by means of terrain analysis and process parameterisation , 2002 .

[40]  Sabine Grunwald,et al.  Multi-criteria characterization of recent digital soil mapping and modeling approaches , 2009 .

[41]  L. Breiman Arcing Classifiers , 1998 .

[42]  Marine Lacoste,et al.  Regional mapping of soil parent material by machine learning based on point data , 2011 .

[43]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[44]  H. Elsenbeer,et al.  Soil organic carbon concentrations and stocks on Barro Colorado Island — Digital soil mapping using Random Forests analysis , 2008 .

[45]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[46]  J. Friedman Stochastic gradient boosting , 2002 .

[47]  M. Köhn Bemerkungen zur mechanischen Bodenanalyse. III. Ein neuer Pipettapparat , 1928 .

[48]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[49]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[50]  Y. Mukaigawa,et al.  Large Deviations Estimates for Some Non-local Equations I. Fast Decaying Kernels and Explicit Bounds , 2022 .

[51]  Nicholas J. Lea,et al.  An Aspect-Driven Kinematic Routing Algorithm , 1992 .

[52]  B. Huwe,et al.  Uncertainty in the spatial prediction of soil texture: Comparison of regression tree and Random Forest models , 2012 .

[53]  R. V. Ruhe,et al.  Elements of the soil landscape. , 1960 .

[54]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[55]  Randall J. Schaetzl,et al.  Soils: Genesis and Geomorphology , 2005 .

[56]  G. Milne A Provisional Soil Map of East Africa , 1936 .

[57]  H. Jenny,et al.  Factors of Soil Formation , 1941 .

[58]  Bernd Huwe,et al.  DIGITAL SOIL MAPPING IN SOUTHERN ECUADOR , 2009 .

[59]  Bernd Huwe,et al.  Making use of the World Reference Base diagnostic horizons for the systematic description of the soil continuum — Application to the tropical mountain soil-landscape of southern Ecuador , 2012 .

[60]  Folkert Christian Bauer Water flow paths in soils of an undisturbed and landslide affected mature montane rainforest in South Ecuador , 2010 .