Addressing the issue of digital mapping of soil classes with imbalanced class observations

Abstract Considering the nature of soils distribution, an important modeling issue in soil class mapping is imbalanced class observations. Imbalanced number of data in observed soil classes in an area can result in the underestimation or loss of minority classes and an overestimation of the majority classes in predictive modeling. The effect of this phenomenon is that an area of land with comparatively fewer soil profile observations could be unmapped in the digital maps. To address this problem, this paper investigated the usefulness of data pretreatment techniques called over- and under-sampling of data applied on three predictive models including decision trees (DT), random forest (RF), and multinomial logistic regression (MNLR). The study area is situated in the northwest of Iran with 452 profiles observations on a regular grid covering about 12,000 ha. This area has 8 USDA soil great groups with an imbalanced frequency distribution. Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in both calibration and validation evaluations. Balancing the classes led to a notable decrease in uncertainty of all 3 models by decreasing the confusion index and raising the probability of occurrence for the soil classes in the final maps. Comparing the 3 models, decision trees showed the largest calibration and validation accuracies with and without data treatment. RF has an issue of overestimation of some of the majority classes. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps.

[1]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[2]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Russell G. Congalton,et al.  A review of assessing the accuracy of classifications of remotely sensed data , 1991 .

[5]  Budiman Minasny,et al.  Using R for Digital Soil Mapping , 2016 .

[6]  B. Minasny,et al.  Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran , 2015 .

[7]  Wei Sun,et al.  Disaggregating and harmonising soil map units through resampled classification trees , 2014 .

[8]  Bart Baesens,et al.  An empirical comparison of techniques for the class imbalance problem in churn prediction , 2017, Inf. Sci..

[9]  Anònim Anònim Keys to Soil Taxonomy , 2010 .

[10]  Budiman Minasny,et al.  Pedology and digital soil mapping (DSM) , 2019, European Journal of Soil Science.

[11]  P.F.M. van Gaans,et al.  Continuous classification in soil survey: spatial correlation, confusion and boundaries , 1997 .

[12]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[13]  Md Zahidul Islam,et al.  Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects , 2018, Inf. Sci..

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Thomas C. Edwards,et al.  Machine learning for predicting soil classes in three semi-arid landscapes , 2015 .

[16]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[17]  E. Costantini,et al.  Beyond the concept of dominant soil: Preserving pedodiversity in upscaling soil maps , 2016 .

[18]  Maqsood Hayat,et al.  Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods , 2017, Artif. Intell. Medicine.

[19]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[20]  Jin Zhang,et al.  An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping , 2016 .

[21]  Sabine Grunwald,et al.  Multi-criteria characterization of recent digital soil mapping and modeling approaches , 2009 .

[22]  J. Boettinger,et al.  Modeling Rare Endemic Shrub Habitat in the Uinta Basin Using Soil, Spectral, and Topographic Data , 2016 .

[23]  Mariette Awad,et al.  KerMinSVM for imbalanced datasets with a case study on arabic comics classification , 2017, Eng. Appl. Artif. Intell..

[24]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[25]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Budiman Minasny,et al.  Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique , 2019, Comput. Electron. Agric..

[27]  Budiman Minasny,et al.  Constructing a soil class map of Denmark based on the FAO legend using digital techniques , 2014 .

[28]  B. Minasny,et al.  On digital soil mapping , 2003 .