Local distribution-based adaptive minority oversampling for imbalanced data classification

Abstract Imbalanced data classification, as a challenging task, has drawn a significant interest in numerous scientific areas. One popular strategy to balance the instance quantities between two classes is oversampling via generating synthetic instances. However, it still suffers from two key issues: where and how many synthetic instances should be generated. In this paper, we propose a Local distribution-based Adaptive Minority Oversampling method (LAMO) to deal with the imbalance classification problem. LAMO first identifies the informative borderline minority instances as sampling seeds according to their neighbors and the corresponding class distribution. Then, LAMO captures the local distribution of each seed according to its Euclidean distances from the nearest majority instance and nearest minority instance.Finally, LAMO generates synthetic instances around seeds via a Gaussian Mixture Model (GMM). For each component of GMM, the mixing coefficient and bandwidth are adaptively set with the aid of seeds’ local distribution. Extensive experiments have been conducted on both simulated and real data sets under varying the imbalance ratio and data size. By comparing with the state-of-the-art oversampling methods, the proposed LAMO obtains promising results in terms of several widely used evaluation metrics.

[1]  Kee-Eung Kim,et al.  An Improved Particle Filter With a Novel Hybrid Proposal Distribution for Quantitative Analysis of Gold Immunochromatographic Strips , 2019, IEEE Transactions on Nanotechnology.

[2]  Bo Tang,et al.  KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[3]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[4]  Maria Dolores Gil Montoya,et al.  A Pareto-based multi-objective evolutionary algorithm for automatic rule generation in network intrusion detection systems , 2013, Soft Comput..

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  F. Pukelsheim The Three Sigma Rule , 1994 .

[7]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[11]  María José del Jesús,et al.  A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets , 2008, Fuzzy Sets Syst..

[12]  Osmar R. Zaïane,et al.  Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[13]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[14]  Yongdong Zhang,et al.  Adaptive weighted imbalance learning with application to abnormal activity recognition , 2016, Neurocomputing.

[15]  David A. Cieslak,et al.  Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[16]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[17]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[18]  Tianbao Yang,et al.  Online Asymmetric Active Learning with Imbalanced Data , 2016, KDD.

[19]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[20]  Zidong Wang,et al.  A new switching-delayed-PSO-based optimized SVM algorithm for diagnosis of Alzheimer's disease , 2018, Neurocomputing.

[21]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[22]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[23]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[24]  Yunqian Ma,et al.  Class Imbalance and Active Learning , 2013 .

[25]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[26]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[27]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[28]  Shiguang Shan,et al.  Multiset Feature Learning for Highly Imbalanced Data Classification , 2017, AAAI.

[29]  Jianping Yin,et al.  Boosting weighted ELM for imbalanced learning , 2014, Neurocomputing.

[30]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[31]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[32]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[33]  ZhouZhi-Hua,et al.  Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2006 .

[34]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[35]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[36]  Shin Ando,et al.  Deep Over-sampling Framework for Classifying Imbalanced Data , 2017, ECML/PKDD.

[37]  Zidong Wang,et al.  Image-Based Quantitative Analysis of Gold Immunochromatographic Strip via Cellular Neural Network Approach , 2014, IEEE Transactions on Medical Imaging.

[38]  Ekrem Duman,et al.  A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing , 2016, Neurocomputing.

[39]  Jing Zhang,et al.  Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data , 2016, Pattern Recognit. Lett..