More Data or a Better Model? Figuring Out What Matters Most for the Spatial Prediction of Soil Carbon

Modeling techniques used in digital soil carbon mapping encompass a variety of algorithms to address spatial prediction problems such as spatial non-stationarity, nonlinearity and multi-colinearity. A given study site can inherit one or more such spatial prediction problems, necessitating the use of a combination of statistical learning algorithms to improve the accuracy of predictions. In addition, the training sample size may affect the accuracy of the model predictions. The effect of varying sample size on model accuracy has not been widely studied in pedometrics. To help fill this gap, we examined the behavior of multiple linear regression (MLR), geographically weighted regression (GWR), linear mixed models (LMMs), Cubist regression trees, quantile regression forests (QRFs), and extreme learning machine regression (ELMR) under varying sample sizes. The results showed that for the study site in the Hunter Valley, Australia, the accuracy of spatial prediction of soil carbon is more sensitive to training sample size compared to the model type used. The prediction accuracy initially increases exponentially with increasing sample size, eventually reaching a plateau. Different models reach their maximum predictive potential at different sample sizes. Furthermore, the uncertainty of model predictions decreases with increasing training sample sizes.

[1]  A. Kozak,et al.  Effects of multicollinearity and autocorrelation on the variable-exponent taper functions , 1997 .

[2]  Budiman Minasny,et al.  From pedotransfer functions to soil inference systems , 2002 .

[3]  Budiman Minasny,et al.  Digital mapping of a soil drainage index for irrigated enterprise suitability in Tasmania, Australia , 2014 .

[4]  Budiman Minasny,et al.  Bottom-up digital soil mapping. I. Soil layer classes , 2011 .

[5]  Budiman Minasny,et al.  Using model averaging to combine soil property rasters from legacy soil maps and from point data , 2014 .

[6]  Budi Setiawan,et al.  Digital mapping for cost-effective and accurate prediction of the depth and carbon stocks in Indonesian peatlands , 2016 .

[7]  Conghe Song,et al.  Downscaling real-time vegetation dynamics by fusing multi-temporal MODIS and Landsat NDVI in topographically complex terrain , 2011 .

[8]  R. Lark,et al.  On spatial prediction of soil properties in the presence of a spatial trend: the empirical best linear unbiased predictor (E‐BLUP) with REML , 2006 .

[9]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[10]  Shifei Ding,et al.  Extreme learning machine and its applications , 2013, Neural Computing and Applications.

[11]  Budiman Minasny,et al.  Quantitative models for pedogenesis — A review , 2008 .

[12]  Sabine Grunwald,et al.  Multi-criteria characterization of recent digital soil mapping and modeling approaches , 2009 .

[13]  R. M. Lark,et al.  Estimating variograms of soil properties by the method‐of‐moments and maximum likelihood , 2000 .

[14]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[15]  Claudia Perlich Learning Curves in Machine Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[16]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[17]  Budiman Minasny,et al.  Digital mapping of soil carbon , 2013 .

[18]  T. Zobeck,et al.  Soil property effects on wind erosion of organic soils , 2013 .

[19]  B. Minasny,et al.  Corrigendum to “Spatial prediction of soil properties using EBLUP with the Matérn covariance function” [Geoderma 140 (2007) 324–336] , 2007 .

[20]  H. Krumholz,et al.  Overestimation of genetic risks owing to small sample sizes in cardiovascular studies , 2003, Clinical genetics.

[21]  Marc Voltz,et al.  A comparison of kriging, cubic splines and classification for predicting soil properties from sample information , 1990 .

[22]  Ken Kelley,et al.  Sample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant. , 2003, Psychological methods.

[23]  Budiman Minasny,et al.  Mapping soil organic carbon content over New South Wales, Australia using local regression kriging , 2016 .

[24]  Gerard B. M. Heuvelink,et al.  Modelling soil variation: past, present, and future , 2001 .

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  R. M. Lark,et al.  Sampling and analytical plus subsampling variance components for five soil indicators observed at regional scale , 2009 .

[27]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[28]  Wei Lee Woon,et al.  Soil Property Prediction: An Extreme Learning Machine Approach , 2015, ICONIP.

[29]  Philippe Lagacherie,et al.  Prediction of topsoil texture for Region Centre (France) applying model ensemble methods , 2017 .

[30]  De-Cheng Li,et al.  Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China , 2016 .

[31]  Budiman Minasny,et al.  Using Additional Criteria for Measuring the Quality of Predictions and Their Uncertainties in a Digital Soil Mapping Framework , 2011 .

[32]  Budiman Minasny,et al.  Challenges for Soil Organic Carbon Research , 2014 .

[33]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[34]  Karin Viergever,et al.  Using knowledge discovery with data mining from the Australian Soil Resource Information System database to inform soil carbon mapping in Australia , 2009 .

[35]  Joseph R. Rausch,et al.  Sample size planning for statistical power and accuracy in parameter estimation. , 2008, Annual review of psychology.

[36]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[37]  S. Fotheringham,et al.  Geographically weighted regression : modelling spatial non-stationarity , 1998 .

[38]  R. Webster,et al.  Baseline map of organic carbon in Australian soil to support national carbon accounting and monitoring under climate change , 2014, Global Change Biology.

[39]  Douglas H. Fisher,et al.  Modeling decision tree performance with the power law , 1999, AISTATS.

[40]  B. Henderson,et al.  Australia-wide predictions of soil properties using decision trees , 2005 .

[41]  David Wheeler,et al.  Multicollinearity and correlation among local regression coefficients in geographically weighted regression , 2005, J. Geogr. Syst..

[42]  Michael D. McKay,et al.  Evaluating Prediction Uncertainty , 1995 .

[43]  J. Seibert,et al.  On the calculation of the topographic wetness index: evaluation of different methods based on field observations , 2005 .

[44]  Geoff Holmes,et al.  Generating Rule Sets from Model Trees , 1999, Australian Joint Conference on Artificial Intelligence.

[45]  R. Lark,et al.  Model‐based analysis using REML for inference from systematically sampled data on soil , 2004 .

[46]  Budiman Minasny,et al.  Digital soil mapping: A brief history and some lessons , 2016 .

[47]  Inakwu O. A. Odeh,et al.  Catchment scale mapping of measureable soil organic carbon fractions , 2014 .

[48]  M. Mamo,et al.  Soil organic carbon: The value to soil properties , 2013, Journal of Soil and Water Conservation.

[49]  Guang-Bin Huang,et al.  Trends in extreme learning machines: A review , 2015, Neural Networks.

[50]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[51]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[52]  Rattan Lal,et al.  Predicting the spatial variation of the soil organic carbon pool at a regional scale. , 2010 .

[53]  Ken Kelley,et al.  Sample size planning for the coefficient of variation from the accuracy in parameter estimation approach , 2007, Behavior research methods.

[54]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[55]  A. Zhu,et al.  Mapping soil organic matter concentration at different scales using a mixed geographically weighted regression method , 2016 .