A hybrid sampling method for imbalanced data

With the diversification of applications and the emergence of new trends in challenging applications such as in the computer vision domain, classical machine learning systems usually perform poorly while confronting two common problems: the training data of negative examples, which outnumber the positive ones, and the large intra-class variations. These problems lead to a drop in the system performances. In this work, we propose to improve the classification accuracy in the case of imbalanced training data by equally balancing a training data set using a hybrid approach which consists in over-sampling the minority class using a “SMOTE star topology”, and under-sampling the majority class by removing instances that are considered less relevant. The feature vector deletion has been performed with respect to intra-class variations, based on the distribution criterion. The experimental results, achieved in bio-metric data, show that the proposed approach significantly improves the overall performances measured in terms of true-positive rate.

[1]  Najoua Essoukri Ben Amara,et al.  New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[2]  Francisco Herrera,et al.  Addressing imbalanced classification with instance generation techniques: IPADE-ID , 2014, Neurocomputing.

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Andrew Zisserman,et al.  Tabula rasa: Model transfer for object category detection , 2011, 2011 International Conference on Computer Vision.

[5]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[6]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Thierry Chateau,et al.  Handling missing weak classifiers in boosted cascade: application to multiview and occluded face detection , 2013, EURASIP J. Image Video Process..

[9]  Jun Yang Learning to Adapt Across Multimedia Domains , 2007 .

[10]  Zhengding Qiu,et al.  The effect of imbalanced data sets on LDA: A theoretical and empirical analysis , 2007, Pattern Recognit..

[11]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[12]  Marzuki Khalid,et al.  A TWO-STEP SUPERVISED LEARNING ARTIFICIAL NEURAL NETWORK FOR IMBALANCED DATASET PROBLEMS , 2012 .

[13]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[14]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[15]  Meng Wang,et al.  Transferring a generic pedestrian detector towards specific scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Hyoungjoo Lee,et al.  The Novelty Detection Approach for Different Degrees of Class Imbalance , 2006, ICONIP.

[17]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[18]  Najoua Essoukri Ben Amara,et al.  SID Signature Database: A Tunisian Off-line Handwritten Signature Database , 2013, ICIAP Workshops.

[19]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[20]  Barbara Caputo,et al.  Leveraging over prior knowledge for online learning of visual categories , 2012, BMVC.

[21]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.