New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets

In classification tasks, class-modular strategy has been widely used. It has outperformed classical strategy for pattern classification task in many applications. However, in some modular architecture, such as one against all in support vector machines classifier, the training dataset for one class risks to heavily outnumber the other classes. In this challenging situation, the trained classifier will accurately classify the majority class; nevertheless, it marginalizes the minority class. As a result, True Negatives rate (TNr) will be very high while the True Positives rate (TPr) will be low. The main goal of this work is to improve TPr without much sacrifice in TNr. In this paper, we propose oversampling the minority class using polynomial fitting functions. Four new approaches were proposed: star topology, bus topology, polynomial curve topology and mesh topology. Star and mesh topologies approach had led to the best performances.

[1]  Katsuhiko Takahashi,et al.  A class-modular GLVQ ensemble with outlier learning for handwritten digit recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[3]  Najoua Essoukri Ben Amara,et al.  Neural Networks and Support Vector Machines Classifiers for Writer Identification Using Arabic Script , 2008, Int. Arab J. Inf. Technol..

[4]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[5]  Edward Y. Chang,et al.  Aligning boundary in kernel space for learning imbalanced dataset , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[7]  Zhengding Qiu,et al.  The effect of imbalanced data sets on LDA: A theoretical and empirical analysis , 2007, Pattern Recognit..

[8]  Hyoungjoo Lee,et al.  The Novelty Detection Approach for Different Degrees of Class Imbalance , 2006, ICONIP.

[9]  Cinthia Obladen de Almendra Freitas,et al.  Evaluating the conventional and class-modular architectures feedforward neural network for handwritten word recognition , 2003, 16th Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2003).

[10]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[12]  Ling Zhuang,et al.  Parameter Optimization of Kernel-based One-class Classifier on Imbalance Learning , 2006, J. Comput..

[13]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Peng Li,et al.  Hybrid Kernel Machine Ensemble for Imbalanced Data Sets , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[16]  Najoua Essoukri Ben Amara,et al.  Une approche d'identification des fontes arabes , 2004 .

[17]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[18]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[19]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[20]  Ching Y. Suen,et al.  A class-modular feedforward neural network for handwriting recognition , 2002, Pattern Recognit..