Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data

Class imbalance problem poses a difficulty to learning algorithms in pattern classification. Oversampling techniques is one of the most widely used techniques to solve these problems, but the majority of them use the sample size ratio as an imbalanced standard. This paper proposes a fuzzy representativeness difference-based oversampling technique, using affinity propagation and the chromosome theory of inheritance (FRDOAC). The fuzzy representativeness difference (FRD) is adopted as a new imbalance metric, which focuses on the importance of samples rather than the number. FRDOAC firstly finds the representative samples of each class according to affinity propagation. Secondly, fuzzy representativeness of every sample is calculated by the Mahalanobis distance. Finally, synthetic positive samples are generated by the chromosome theory of inheritance until the fuzzy representativeness difference of two classes is small. A thorough experimental study on 16 benchmark datasets was performed and the results show that our method is better than other advanced imbalanced classification algorithms in terms of various evaluation metrics.

[1]  Zhongheng Zhang,et al.  Introduction to machine learning: k-nearest neighbors. , 2016, Annals of translational medicine.

[2]  Jun Zhang,et al.  Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning , 2017, IEEE Transactions on Fuzzy Systems.

[3]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[6]  Hans-Jürgen Zimmermann,et al.  Fuzzy set theory , 1992 .

[7]  Paul Jen-Hwa Hu,et al.  A preclustering-based ensemble learning technique for acute appendicitis diagnoses , 2013, Artif. Intell. Medicine.

[8]  Akiyuki Taruno,et al.  Na+ homeostasis by epithelial Na+ channel (ENaC) and Nax channel (Nax): cooperation of ENaC and Nax. , 2016, Annals of translational medicine.

[9]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[10]  Bo Tang,et al.  ENN: Extended Nearest Neighbor Method for Pattern Recognition [Research Frontier] , 2015, IEEE Computational Intelligence Magazine.

[11]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[12]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[15]  Sanyam Shukla,et al.  Class-specific cost-sensitive boosting weighted ELM for class imbalance learning , 2018, Memetic Computing.

[16]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[17]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[18]  WangGuangtao,et al.  A dissimilarity-based imbalance data classification algorithm , 2015 .

[19]  Youlong Yang,et al.  Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning , 2018, Knowl. Based Syst..

[20]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[21]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[22]  W. S. Sutton THE CHROMOSOMES IN HEREDITY , 1903 .

[23]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[24]  H.-J. Zimmermann Fuzzy set theory , 2010 .

[25]  Francisco Herrera,et al.  Imbalance: Oversampling algorithms for imbalanced classification in R , 2018, Knowl. Based Syst..

[26]  Dongdong Li,et al.  Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem , 2019, Neurocomputing.

[27]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[28]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[29]  Eneko Osaba,et al.  Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics , 2019, Applied Intelligence.

[30]  Akito Monden,et al.  MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction , 2018, IEEE Transactions on Software Engineering.

[31]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[32]  Ligang Zhou,et al.  Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods , 2013, Knowl. Based Syst..

[33]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[34]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[35]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[36]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[37]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[38]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[39]  Abbas Akkasi,et al.  Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text , 2017, Applied Intelligence.

[40]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[41]  Bo Tang,et al.  GIR-based ensemble sampling approaches for imbalanced learning , 2017, Pattern Recognit..

[42]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[43]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[44]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[45]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[46]  Tin Kam Ho,et al.  A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors , 2002, Pattern Analysis & Applications.

[47]  Dongdong Li,et al.  Tree-based space partition and merging ensemble learning framework for imbalanced problems , 2019, Inf. Sci..

[48]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[49]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[50]  Marco Vannucci,et al.  A method for resampling imbalanced datasets in binary classification tasks for real-world problems , 2014, Neurocomputing.

[51]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[52]  Jie Li,et al.  EDOS: Entropy Difference-based Oversampling Approach for Imbalanced Learning , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).