A new over-sampling technique based on SVM for imbalanced diseases data

In the real world, there are many kinds of diseases data, whose patients are composed of majority normal persons and only minority abnormal ones. Many researchers ignored these imbalance problems, so their learning models usually led to a bias in the majority normal class. To deal with this problem, a new over-sampling technique was proposed to over-sample the minority class to balance the data samples and improve Support Vector Machine(SVM) in imbalanced diseases data sets. For the minority class, a K-Nearest Neighbor(KNN) graph is built. Second, the proposed technique gets a Minimum Spanning Tree(MST) based on the graph. Third, the proposed technique generates synthetic samples by using SMOTE along the direct path in the tree. The performance of the proposed technique based on SVM is evaluated with several diseases data sets taken from the UCI machine learning repository, and the experiments show that the proposed technique based on SVM can improve the Sensitivity value and G-Mean value.

[1]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[2]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[3]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[4]  Chao Chen,et al.  Data management support via spectrum perturbation-based subspace classification in collaborative environments , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[5]  Guodong Zhou,et al.  Imbalanced sentiment classification , 2011, CIKM '11.

[6]  Max Q.-H. Meng,et al.  Tumor Recognition in Wireless Capsule Endoscopy Images Using Textural Features and SVM-Based Feature Selection , 2012, IEEE Transactions on Information Technology in Biomedicine.

[7]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[8]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  O. Jafar,et al.  A study on fuzzy and particle swarm optimization algorithms and their applications to clustering problems , 2012, 2012 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT).

[11]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[12]  Giuseppe De Pietro,et al.  An evolutionary-fuzzy DSS for assessing health status in multiple sclerosis disease , 2011, Int. J. Medical Informatics.

[13]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[14]  Amine Chikh,et al.  Diagnosis of Diabetes Diseases Using an Artificial Immune Recognition System2 (AIRS2) with Fuzzy K-nearest Neighbor , 2012, Journal of Medical Systems.