Machine learning for predicting soil classes in three semi-arid landscapes

Abstract Mapping the spatial distribution of soil taxonomic classes is important for informing soil use and management decisions. Digital soil mapping (DSM) can quantitatively predict the spatial distribution of soil taxonomic classes. Key components of DSM are the method and the set of environmental covariates used to predict soil classes. Machine learning is a general term for a broad set of statistical modeling techniques. Many different machine learning models have been applied in the literature and there are different approaches for selecting covariates for DSM. However, there is little guidance as to which, if any, machine learning model and covariate set might be optimal for predicting soil classes across different landscapes. Our objective was to compare multiple machine learning models and covariate sets for predicting soil taxonomic classes at three geographically distinct areas in the semi-arid western United States of America (southern New Mexico, southwestern Utah, and northeastern Wyoming). All three areas were the focus of digital soil mapping studies. Sampling sites at each study area were selected using conditioned Latin hypercube sampling (cLHS). We compared models that had been used in other DSM studies, including clustering algorithms, discriminant analysis, multinomial logistic regression, neural networks, tree based methods, and support vector machine classifiers. Tested machine learning models were divided into three groups based on model complexity: simple, moderate, and complex. We also compared environmental covariates derived from digital elevation models and Landsat imagery that were divided into three different sets: 1) covariates selected a priori by soil scientists familiar with each area and used as input into cLHS, 2) the covariates in set 1 plus 113 additional covariates, and 3) covariates selected using recursive feature elimination. Overall, complex models were consistently more accurate than simple or moderately complex models. Random forests (RF) using covariates selected via recursive feature elimination was consistently the most accurate, or was among the most accurate, classifiers between study areas and between covariate sets within each study area. We recommend that for soil taxonomic class prediction, complex models and covariates selected by recursive feature elimination be used. Overall classification accuracy in each study area was largely dependent upon the number of soil taxonomic classes and the frequency distribution of pedon observations between taxonomic classes. Individual subgroup class accuracy was generally dependent upon the number of soil pedon observations in each taxonomic class. The number of soil classes is related to the inherent variability of a given area. The imbalance of soil pedon observations between classes is likely related to cLHS. Imbalanced frequency distributions of soil pedon observations between classes must be addressed to improve model accuracy. Solutions include increasing the number of soil pedon observations in classes with few observations or decreasing the number of classes. Spatial predictions using the most accurate models generally agree with expected soil–landscape relationships. Spatial prediction uncertainty was lowest in areas of relatively low relief for each study area.

[1]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[2]  John P. Wilson,et al.  Terrain analysis : principles and applications , 2000 .

[3]  M. Kovacevic,et al.  Soil type classification and estimation of soil properties using support vector machines , 2010 .

[4]  Philippe Lagacherie,et al.  The utility of remotely-sensed vegetative and terrain covariates at different spatial resolutions in modelling soil and watertable depth (for digital soil mapping) , 2013 .

[5]  D. B. Myers,et al.  Associations between soil carbon and ecological landscape variables at escalating spatial scales in Florida, USA , 2012, Landscape Ecology.

[6]  Anne Gobin,et al.  Logistic Modeling to Spatially Predict the Probability of Soil Drainage Classes , 2002 .

[7]  N. Toomanian,et al.  Selection of a taxonomic level for soil mapping using diversity and map purity indices: A case study from an Iranian arid region , 2013 .

[8]  Thorsten Behrens,et al.  Digital soil mapping using artificial neural networks , 2005 .

[9]  M. Shirasawa,et al.  Visualizing topography by openness: A new application of image processing to digital elevation models , 2002 .

[10]  Budiman Minasny,et al.  Bottom-up digital soil mapping. I. Soil layer classes , 2011 .

[11]  Charles E. Kellogg,et al.  Soil Survey Manual , 2017 .

[12]  Russell G. Congalton,et al.  Assessing the accuracy of remotely sensed data : principles and practices , 1998 .

[13]  Budiman Minasny,et al.  Using model averaging to combine soil property rasters from legacy soil maps and from point data , 2014 .

[14]  G. Green,et al.  The Digital Geologic Map of New Mexico in ARC/INFO Format , 1997 .

[15]  Gerard B. M. Heuvelink,et al.  Soil type mapping using the generalised linear geostatistical model: A case study in a Dutch cultivated peatland , 2012 .

[16]  G. D. Barrio,et al.  Mapping soil depth classes in dry Mediterranean areas using terrain attributes derived from a digital elevation model , 1996 .

[17]  G. M. van Zijl,et al.  Rapid soil mapping under restrictive conditions in Tete, Mozambique , 2012 .

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  Patrick Bogaert,et al.  Updating soil survey maps using random forest and conditioned Latin hypercube sampling in the loess derived soils of northern Iran , 2014 .

[20]  Jeffrey E. Herrick,et al.  Spatial and temporal variability of plant-available water in calcium carbonate-cemented soils and consequences for arid ecosystem resilience , 2010, Oecologia.

[21]  R. M. Wallace,et al.  Terrain Analysis Using Digital Elevation Models , 2001 .

[22]  A. Farshad,et al.  Artificial Neural Network and Decision Tree in Predictive Soil Mapping of Hoi Num Rin Sub-Watershed, Thailand , 2010 .

[23]  J. Boettinger,et al.  Spatial prediction of biological soil crust classes: Value added DSM from soil survey , 2012 .

[24]  T. Hengl,et al.  Geomorphometry: Concepts, software, applications , 2009 .

[25]  Sabine Grunwald,et al.  Multi‐scale Modeling of Soil Series Using Remote Sensing in a Wetland Ecosystem , 2012 .

[26]  R. J. Kauth,et al.  The Tasseled Cap de-mystified , 1986 .

[27]  Jongsung Kim,et al.  Holistic environmental soil-landscape modeling of soil organic carbon , 2014, Environ. Model. Softw..

[28]  Thorsten Behrens,et al.  Chapter 25 A Comparison of Data-Mining Techniques in Predictive Soil Mapping , 2006 .

[29]  Rosa Francaviglia,et al.  Simulation of soil types in Teramo province (Central Italy) with terrain parameters and remote sensing data , 2011 .

[30]  F. Nachtergaele Soil taxonomy—a basic system of soil classification for making and interpreting soil surveys: Second edition, by Soil Survey Staff, 1999, USDA–NRCS, Agriculture Handbook number 436, Hardbound , 2001 .

[31]  R. D. Ramsey,et al.  Mapping moderate-scale land-cover over very large geographic areas within a collaborative framework : A case study of the Southwest Regional Gap Analysis Project (SWReGAP) , 2007 .

[32]  Lubos Boruvka,et al.  Delineating Acidified Soils in the Jizera Mountains Region Using Fuzzy Classification , 2008 .

[33]  A. Stum,et al.  Random Forests Applied as a Soil Spatial Predictive Model in Arid Utah , 2010 .

[34]  S. Kienast-Brown,et al.  Applying the Optimum Index Factor to Multiple Data Types in Soil Survey , 2010 .

[35]  Elizabeth Pattey,et al.  Mapping within-field soil drainage using remote sensing, DEM and apparent soil electrical conductivity , 2008 .

[36]  Laura Poggio,et al.  Regional scale mapping of soil properties and their uncertainty with a large number of satellite-derived covariates , 2013 .

[37]  R. DeFries,et al.  LAND USE AND CLIMATE , 2012 .

[38]  Chris Moran,et al.  A strategy to fill gaps in soil survey over large spatial extents: an example from the Murray-Darling basin of Australia , 2003 .

[39]  Tomislav Hengl,et al.  Methods to interpolate soil categorical variables from profile observations: Lessons from Iran , 2007 .

[40]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[41]  Hossein Khademi,et al.  Spatial prediction of USDA‐ great soil groups in the arid Zarand region, Iran: comparing logistic regression approaches to predict diagnostic horizons and soil types , 2012 .

[42]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[43]  T. Bishop,et al.  A digital soil map of Phytophthora cinnamomi in the Gondwana Rainforests of eastern Australia , 2012 .

[44]  Dominique Arrouays,et al.  Extrapolating regional soil landscapes from an existing soil map: Sampling intensity, validation procedures, and integration of spatial context , 2008 .

[45]  D. M. Lemmon,et al.  Geologic map of the Milford Quadrangle and east half of the Frisco Quadrangle, Beaver County, Utah , 1989 .

[46]  J. Peters,et al.  Random forests as a tool for ecohydrological distribution modelling , 2007 .

[47]  Bahram Daneshfar,et al.  Use of weights of evidence statistics to define inference rules to disaggregate soil survey maps , 2012 .

[48]  Ian Witten,et al.  Data Mining , 2000 .

[49]  A-Xing Zhu,et al.  Soil Mapping Using GIS, Expert Knowledge, and Fuzzy Logic , 2001 .

[50]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[51]  Alfred E. Hartemink,et al.  Digital Soil Mapping with Limited Data , 2008 .

[52]  B. Henderson,et al.  Australia-wide predictions of soil properties using decision trees , 2005 .

[53]  P. Scull,et al.  The application of classification tree analysis to soil type prediction in a desert landscape , 2005 .

[54]  Budiman Minasny,et al.  A conditioned Latin hypercube method for sampling in the presence of ancillary information , 2006, Comput. Geosci..

[55]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[56]  D. Beaudette,et al.  Quantifying the Aspect Effect: An Application of Solar Radiation Modeling for Soil Survey , 2009 .

[57]  Margaret G. Schmidt,et al.  Predictive soil parent material mapping at a regional-scale: a Random Forest approach. , 2014 .

[58]  John R. Jensen,et al.  Introductory Digital Image Processing: A Remote Sensing Perspective , 1986 .

[59]  A. Huete A soil-adjusted vegetation index (SAVI) , 1988 .

[60]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[61]  Lutz Breuer,et al.  Land use and climate control the spatial distribution of soil types in the grasslands of Inner Mongolia , 2013 .

[62]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[63]  T. Behrens,et al.  The ConMap approach for terrain‐based digital soil mapping , 2010 .

[64]  Russell G. Congalton,et al.  A review of assessing the accuracy of classifications of remotely sensed data , 1991 .

[65]  Sue J. Welham,et al.  Estimating the spatial scales of regionalized variables by nested sampling, hierarchical analysis of variance and residual maximum likelihood , 2006, Comput. Geosci..

[66]  P.F.M. van Gaans,et al.  Continuous classification in soil survey: spatial correlation, confusion and boundaries , 1997 .

[67]  A-Xing Zhu,et al.  The ConMap approach for terrain-based digital soil mapping , 2010 .

[68]  G. Green,et al.  The Digital Geologic Map of Wyoming in ARC/INFO Format , 1994 .

[69]  Gary Higgs,et al.  Supervised classifications of Landsat TM band ratio images and Landsat TM band ratio image with radar for geological interpretations of central Madagascar , 2003 .

[70]  John Triantafilis,et al.  Digital soil-class mapping across the Edgeroi district using numerical clustering and gamma-ray spectrometry data , 2012 .

[71]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[72]  J. Gallant,et al.  A multiresolution index of valley bottom flatness for mapping depositional areas , 2003 .

[73]  J. L. Boettinger,et al.  Chapter 27 Pedogenic Understanding Raster Classification Methodology for Mapping Soils, Powder River Basin, Wyoming, USA , 2006 .

[74]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[75]  P. Chavez Image-Based Atmospheric Corrections - Revisited and Improved , 1996 .

[76]  David G. Chandler,et al.  Modeling soil depth from topographic and land cover attributes , 2009 .

[77]  Lars Niklasson,et al.  Genetic rule extraction optimizing brier score , 2010, GECCO '10.

[78]  Timothy C. Coburn,et al.  Environmental Soil-Landscape Modeling: Geographic Information Technologies and Pedometrics , 2007 .

[79]  Brandon T Bestelmeyer,et al.  Soil-geomorphic heterogeneity governs patchy vegetation dynamics at an arid ecotone. , 2006, Ecology.

[80]  Budiman Minasny,et al.  Incorporating taxonomic distance into spatial prediction and digital mapping of soil classes , 2007 .

[81]  Sarah C. Goslee,et al.  Analyzing Remote Sensing Data in R: The landsat Package , 2011 .

[82]  J. Boettinger Environmental Covariates for Digital Soil Mapping in the Western USA , 2010 .

[83]  Brian K. Slater,et al.  Soil Series Mapping By Knowledge Discovery from an Ohio County Soil Map , 2013 .

[84]  B. Minasny,et al.  Bottom-up digital soil mapping. II. Soil series classes , 2011 .

[85]  Alfred E. Hartemink,et al.  Digital soil mapping: bridging research, environmental application, and operation , 2010 .

[86]  Xun Shi,et al.  Integrating different types of knowledge for digital soil mapping. , 2009 .