Classification of Imbalanced Data Represented as Binary Features

Typically, classification is conducted on a dataset that consists of numerical features and target classes. For instance, a grayscale image, which is usually represented as a matrix of integers varying from 0 to 255, enables one to apply various classification algorithms to image classification tasks. However, datasets represented as binary features cannot use many standard machine learning algorithms optimally, yet their amount is not negligible. On the other hand, oversampling algorithms such as synthetic minority oversampling technique (SMOTE) and its variants are often used if the dataset for classification is imbalanced. However, since SMOTE and its variants synthesize new minority samples based on the original samples, the diversity of the samples synthesized from binary features is highly limited due to the poor representation of original features. To solve this problem, a preprocessing approach is studied. By converting binary features into numerical ones using feature extraction methods, succeeding oversampling methods can fully display their potential in improving the classifiers’ performances. Through comprehensive experiments using benchmark datasets and real medical datasets, it was observed that a converted dataset consisting of numerical features is better for oversampling methods (maximum improvements of accuracy and F1-score were 35.11% and 42.17%, respectively). In addition, it is confirmed that feature extraction and oversampling synergistically contribute to the improvement of classification performance.

[1]  Krung Sinapiromsaran,et al.  The Effective Redistribution For Imbalance Dataset: Relocating Safe-Eevel Smote With Minority Outcast Handling , 2016 .

[2]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[3]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[4]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[5]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[6]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[7]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[8]  Kenji Satou,et al.  Machine Learning Algorithms for Predicting Chronic Obstructive Pulmonary Disease from Gene Expression Data with Class Imbalance , 2021, BIOINFORMATICS.

[9]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[10]  José Salvador Sánchez,et al.  Surrounding neighborhood-based SMOTE for learning from imbalanced data sets , 2012, Progress in Artificial Intelligence.

[11]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[12]  Kenji Satou,et al.  Cross Entropy Based Sparse Logistic Regression to Identify Phenotype-Related Mutations in Methicillin-Resistant Staphylococcus aureus , 2020 .

[13]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[14]  Shahidan M. Abdullah,et al.  An overview of principal component analysis , 2013 .

[15]  Francisco Herrera,et al.  A Compact Evolutionary Interval-Valued Fuzzy Rule-Based Classification System for the Modeling and Prediction of Real-World Financial Applications With Imbalanced Data , 2015, IEEE Transactions on Fuzzy Systems.

[16]  Rogelio Florencia Juárez,et al.  Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data , 2020, Expert Syst. Appl..

[17]  Haibo He,et al.  Assessment Metrics for Imbalanced Learning , 2013 .

[18]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[19]  Rodica Potolea,et al.  Imbalanced Classification Problems: Systematic Study, Issues and Best Practices , 2011, ICEIS.

[20]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[21]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[22]  Taghi M. Khoshgoftaar,et al.  The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[23]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[24]  Kin Keung Lai,et al.  Benchmarking binary classification models on data sets with different degrees of imbalance , 2009, Frontiers of Computer Science in China.

[25]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[26]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[27]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[28]  Fernando Bação,et al.  Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE , 2019, Inf. Sci..

[29]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[30]  T. Jayanthi,et al.  Weighted-SMOTE: A modification to SMOTE for event classification in sodium cooled fast reactors , 2017 .

[31]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[32]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[34]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[35]  Hezlin Aryani Abd Rahman,et al.  Handling imbalanced dataset using SVM and k-NN approach , 2016 .