Efficiency of Extreme Gradient Boosting for Imbalanced Land Cover Classification Using an Extended Margin and Disagreement Performance

Imbalanced learning is a methodological challenge in remote sensing communities, especially in complex areas where the spectral similarity exists between land covers. Obtaining high-confidence classification results for imbalanced class issues is highly important in practice. In this paper, extreme gradient boosting (XGB), a novel tree-based ensemble system, is employed to classify the land cover types in Very-high resolution (VHR) images with imbalanced training data. We introduce an extended margin criterion and disagreement performance to evaluate the efficiency of XGB in imbalanced learning situations and examine the effect of minority class spectral separability on model performance. The results suggest that the uncertainty of XGB associated with correct classification is stable. The average probability-based margin of correct classification provided by XGB is 0.82, which is about 46.30% higher than that by random forest (RF) method (0.56). Moreover, the performance uncertainty of XGB is insensitive to spectral separability after the sample imbalance reached a certain level (minority:majority > 10:100). The impact of sample imbalance on the minority class is also related to its spectral separability, and XGB performs better than RF in terms of user accuracy for the minority class with imperfect separability. The disagreement components of XGB are better and more stable than RF with imbalanced samples, especially for complex areas with more types. In addition, appropriate sample imbalance helps to improve the trade-off between the recognition accuracy of XGB and the sample cost. According to our analysis, this margin-based uncertainty assessment and disagreement performance can help users identify the confidence level and error component in similar classification performance (overall, producer, and user accuracies).

[1]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[2]  Junliang Fan,et al.  Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China , 2018 .

[3]  Tong Zhang,et al.  Gradient boosting model for unbalanced quantitative mass spectra quality assessment , 2017, 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC).

[4]  Jing Zhang,et al.  Large cost-sensitive margin distribution machine for imbalanced data classification , 2017, Neurocomputing.

[5]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[6]  Bambang H. Trisasongko,et al.  Combining Binary and Post-Classification Change Analysis of Augmented ALOS Backscatter for Identifying Subtle Land Cover Changes , 2019, Remote. Sens..

[7]  Q. Guo,et al.  A Framework for Supervised Image Classification with Incomplete Training Samples , 2012 .

[8]  Joelle Pineau,et al.  Online Bagging and Boosting for Imbalanced Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[10]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[11]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[12]  Onisimo Mutanga,et al.  Estimating tree species diversity in the savannah using NDVI and woody canopy cover , 2018, Int. J. Appl. Earth Obs. Geoinformation.

[13]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Paul M. Mather,et al.  An assessment of the effectiveness of decision tree methods for land cover classification , 2003 .

[15]  Zhelong Wang,et al.  Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset , 2016, Neurocomputing.

[16]  Mariana Belgiu,et al.  Random forest in remote sensing: A review of applications and future directions , 2016 .

[17]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[18]  Shuai Zhang,et al.  A novel ensemble method for credit scoring: Adaption of different imbalance ratios , 2018, Expert Syst. Appl..

[19]  Luís Torgo,et al.  A Survey of Predictive Modelling under Imbalanced Distributions , 2015, ArXiv.

[20]  B. Pradhan,et al.  Landslide Susceptibility Assessment in Vietnam Using Support Vector Machines, Decision Tree, and Naïve Bayes Models , 2012 .

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[23]  Mustafa Ustuner,et al.  Polarimetric Target Decompositions and Light Gradient Boosting Machine for Crop Classification: A Comparative Evaluation , 2019, ISPRS Int. J. Geo Inf..

[24]  Hannes Taubenböck,et al.  Class imbalance in unsupervised change detection - A diagnostic analysis from urban remote sensing , 2017, Int. J. Appl. Earth Obs. Geoinformation.

[25]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[26]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[27]  Giles M. Foody,et al.  Training set size requirements for the classification of a specific class , 2006 .

[28]  Lorenzo Bruzzone,et al.  Classification of imbalanced remote-sensing data by neural networks , 1997, Pattern Recognit. Lett..

[29]  P. J. García Nieto,et al.  Pressure drop modelling in sand filters in micro-irrigation using gradient boosted regression trees , 2018, Biosystems Engineering.

[30]  Samia Boukir,et al.  Exploring diversity in ensemble classification: Applications in large area land cover mapping , 2017 .

[31]  John A. Richards,et al.  Remote Sensing Digital Image Analysis , 1986 .

[32]  Sabine Vanhuysse,et al.  Very High Resolution Object-Based Land Use–Land Cover Urban Classification Using Extreme Gradient Boosting , 2018, IEEE Geoscience and Remote Sensing Letters.

[33]  R. Pontius,et al.  Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment , 2011 .

[34]  Shuxiao Li,et al.  Cost-Effective Class-Imbalance Aware CNN for Vehicle Localization and Categorization in High Resolution Aerial Images , 2017, Remote. Sens..

[35]  Hannes Taubenböck,et al.  Estimation of seismic building structural types using multi-sensor remote sensing and machine learning techniques , 2015 .

[36]  Gregory Asner,et al.  Tree Species Abundance Predictions in a Tropical Agricultural Landscape with a Supervised Classification Model and Imbalanced Data , 2016, Remote. Sens..

[37]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[38]  C. Woodcock,et al.  Classification and Change Detection Using Landsat TM Data: When and How to Correct Atmospheric Effects? , 2001 .

[39]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[40]  Robert Gilmore Pontius,et al.  Quantity, exchange, and shift components of difference in a square contingency table , 2014 .

[41]  C. Lippitt,et al.  Mapping Selective Logging in Mixed Deciduous Forest: A Comparison of Machine Learning Algorithms , 2008 .

[42]  Francisco José Climent Diranzo,et al.  Predicting failure in the U.S. banking sector: An extreme gradient boosting approach , 2019, International Review of Economics & Finance.

[43]  Mario Chica-Olmo,et al.  An assessment of the effectiveness of a random forest classifier for land-cover classification , 2012 .

[44]  Chongsheng Zhang,et al.  An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme , 2018, Knowl. Based Syst..

[45]  Giles M. Foody,et al.  Status of land cover classification accuracy assessment , 2002 .

[46]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[47]  Xi Chen,et al.  Semisupervised Feature Selection for Unbalanced Sample Sets of VHR Images , 2010, IEEE Geoscience and Remote Sensing Letters.

[48]  Ajith Abraham,et al.  Modeling Insurance Fraud Detection Using Imbalanced Data Classification , 2015, NaBIC.

[49]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[50]  Wenkai Li,et al.  A Positive and Unlabeled Learning Algorithm for One-Class Classification of Remote-Sensing Data , 2011, IEEE Transactions on Geoscience and Remote Sensing.

[51]  Jong-Seok Lee,et al.  A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification , 2016, IMCOM.

[52]  Samia Boukir,et al.  Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin , 2015 .

[53]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[54]  Lior Rokach,et al.  Ensemble methods for multi-label classification , 2013, Expert Syst. Appl..

[55]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[56]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[57]  F. Provost Machine Learning from Imbalanced Data Sets 101 Extended , .