LR-SMOTE - An improved unbalanced data set oversampling based on K-means and SVM

Abstract Machine learning classification algorithms are currently widely used. One of the main problems faced by classification algorithms is the problem of unbalanced data sets. Classification algorithms are not sensitive to unbalanced data sets, therefore, it is difficult to classify unbalanced data sets. There is also a problem of unbalanced data categories in the field of loose particle detection of sealed electronic components. The signals generated by internal components are always more than the signals generated by loose particles, which easily leads to misjudgment in classification. To classify unbalanced data sets more accurately, in this paper, based on the traditional oversampling SMOTE algorithm, the LR-SMOTE algorithm is proposed to make the newly generated samples close to the sample center, avoid generating outlier samples or changing the distribution of data sets. Experiments were carried out on four sets of UCI public data sets and six sets of self-built data sets. Unmodified data sets balanced by LR-SMOTE and SMOTE algorithms used random forest algorithm and support vector machine algorithm respectively. The experimental results show that the LR-SMOTE has better performance than the SMOTE algorithm in terms of G-means value, F-measure value and AUC.

[1]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[2]  Junhao Wen,et al.  SVM-TIA a shilling attack detection method based on SVM and target item analysis in recommender systems , 2016, Neurocomputing.

[3]  Bang An,et al.  Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis , 2020, Comput. Ind. Eng..

[4]  Iman Nekooeimehr,et al.  Cluster-based Weighted Oversampling for Ordinal Regression (CWOS-Ord) , 2016, Neurocomputing.

[5]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[6]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[7]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[8]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[9]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[10]  Yi-Hung Liu,et al.  Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines , 2007, IEEE Transactions on Neural Networks.

[11]  Ling Tang,et al.  A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data , 2018, Appl. Soft Comput..

[12]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[13]  Robert Sabourin,et al.  Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs , 2010, Pattern Recognit..

[14]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[15]  Jia Song,et al.  A bi-directional sampling based on K-means method for imbalance text classification , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[16]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Ma Li,et al.  CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests , 2017, BMC Bioinformatics.

[18]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[19]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[20]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[23]  Hui Li,et al.  The clustering-based case-based reasoning for imbalanced business failure prediction: a hybrid approach through integrating unsupervised process with supervised process , 2014, Int. J. Syst. Sci..

[24]  Miriam Seoane Santos,et al.  A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients , 2015, J. Biomed. Informatics.

[25]  Q. Henry Wu,et al.  Association Rule Mining-Based Dissolved Gas Analysis for Fault Diagnosis of Power Transformers , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[27]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[28]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[29]  Yun Yang,et al.  Constructing ECOC based on confusion matrix for multiclass learning problems , 2015, Science China Information Sciences.

[30]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[31]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).