Genetic Algorithms Based Resampling for the Classification of Unbalanced Datasets

In the paper a resampling approach for unbalanced datasets classification is proposed. The method suitably combines undersampling and oversampling by means of genetic algorithms according to a set of criteria and determines the optimal unbalance rate. The method has been tested on industrial and literature datasets. The achieved results put into evidence a sensible increase of the rare patterns detection rate and an improvement of the classification performance.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Marco Vannucci,et al.  Smart Under-Sampling for the Detection of Rare Patterns in Unbalanced Datasets , 2016, KES-IDT.

[3]  Qiang Yang,et al.  Decision trees with minimal costs , 2004, ICML.

[4]  Marco Vannucci,et al.  A method for resampling imbalanced datasets in binary classification tasks for real-world problems , 2014, Neurocomputing.

[5]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[6]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[7]  Marco Vannucci,et al.  Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic , 2011, Appl. Soft Comput..

[8]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[9]  Marco Vannucci,et al.  Detection of rare events within industrial datasets by means of data resampling and specific algorithms , 2010 .

[10]  Marco Vannucci,et al.  Thresholded Neural Networks for Sensitive Industrial Classification Tasks , 2009, IWANN.

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  Marco Vannucci,et al.  A Fuzzy Inference System Applied to Defect Detection in Flat Steel Production , 2010, IEEE WCCI 2010.

[13]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[14]  Nicolás García-Pedrajas,et al.  Class Imbalance Methods for Translation Initiation Site Recognition , 2010, IEA/AIE.

[15]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Robi Polikar,et al.  Ensemble Techniques with Weighted Combination Rules for Early Diagnosis of Alzheimer's Disease , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[17]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.