Exploring the novel support points-based split method on a soil dataset

Abstract Data splitting is an integral step in machine learning that ensures good model generalization. The novel support points-based split method has been evaluated on several datasets (e.g. Iris dataset, etc.) and has shown to be promising than conventional methods (e.g. the random data split). However, this method has never been applied in soil-based research. Therefore, the current study compared soil organic carbon (SOC) RMSE prediction results generated through the conventional random split and the novel support points-based split methods. While applying the above-mentioned methods, data were partitioned into train and test sets based on four percentage ratios of 60/40, 70/30, 75/25 and 80/20. Generally, test RMSE results based on the two split methods as well as percentage ratios were comparable. Nonetheless, the novel method is more reliable and robust since it applies iterations to perform the splitting process while utilizing control points to establish an optimal data partition.

[1]  Trevor Hastie,et al.  Support Vector Machines , 2013 .

[2]  A. M. Samuel,et al.  Soils and soil management , 2014 .

[3]  O. Fernández‐Ugalde,et al.  Comparison of sampling with a spade and gouge auger for topsoil monitoring at the continental scale , 2019, European Journal of Soil Science.

[4]  A. M. Samuel,et al.  Lockhart & Wiseman's crop husbandry including grassland , 1993 .

[5]  W. Lipiński,et al.  Mineral nitrogen content in hydrographic areas of Poland depending on land use , 2019, International Agrophysics.

[6]  Ā. Jansons,et al.  Root-Soil Plate Characteristics of Silver Birch on Wet and Dry Mineral Soils in Latvia , 2020, Forests.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Francis L Martin,et al.  Improving data splitting for classification applications in spectrochemical analyses employing a random-mutation Kennard-Stone algorithm approach , 2019, Bioinform..

[9]  Weiwei Ming,et al.  Deep learning-based tool wear prediction and its application for machining process using multi-scale feature fusion and channel attention mechanism , 2021, Measurement.

[10]  F. Hagedorn,et al.  A 13C tracer study to identify the origin of dissolved organic carbon in forested mineral soils , 2004 .

[11]  Holger R. Maier,et al.  Data splitting for artificial neural networks using SOM-based stratified sampling , 2010, Neural Networks.

[12]  Brian D. Marx,et al.  Multivariate calibration on heterogeneous samples , 2021 .

[13]  R. V. Viscarra Rossel,et al.  National-scale spectroscopic assessment of soil organic carbon in forests of the Czech Republic , 2021 .

[14]  Gunasekaran Manogaran,et al.  A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic , 2020, Measurement.

[15]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[16]  Yun Xu,et al.  On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning , 2018, Journal of Analysis and Testing.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[19]  N. Hinko-Najera,et al.  Refining benchmarks for soil organic carbon in Australia’s temperate forests , 2020 .

[20]  Lutgarde M. C. Buydens,et al.  The potential of field spectroscopy for the assessment of sediment properties in river floodplains , 2003 .

[21]  P. Reich,et al.  Tree species effects on coupled cycles of carbon, nitrogen, and acidity in mineral soils at a common garden experiment , 2012, Biogeochemistry.