A learning method for the class imbalance problem with medical data sets

In medical data sets, data are predominately composed of "normal" samples with only a small percentage of "abnormal" ones, leading to the so-called class imbalance problems. In class imbalance problems, inputting all the data into the classifier to build up the learning model will usually lead a learning bias to the majority class. To deal with this, this paper uses a strategy which over-samples the minority class and under-samples the majority one to balance the data sets. For the majority class, this paper builds up the Gaussian type fuzzy membership function and alpha-cut to reduce the data size; for the minority class, we use the mega-trend diffusion membership function to generate virtual samples for the class. Furthermore, after balancing the data size of classes, this paper extends the data attribute dimension into a higher dimension space using classification related information to enhance the classification accuracy. Two medical data sets, Pima Indians' diabetes and the BUPA liver disorders, are employed to illustrate the approach presented in this paper. The results indicate that the proposed method has better classification performance than SVM, C4.5 decision tree and two other studies.

[1]  Man-sun Kim An Effective Under-Sampling Method for Class Imbalance Data Problem , 2007 .

[2]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[3]  Giles M. Foody,et al.  The significance of border training patterns in classification by a feedforward neural network using back propagation learning , 1999 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[7]  Horst Bunke,et al.  Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[9]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[10]  Off line? , 2007, BMJ : British Medical Journal.

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  José Martínez Sotoca,et al.  Improving the Performance of the RBF Neural Networks Trained with Imbalanced Samples , 2007, IWANN.

[13]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[14]  Emre Çomak,et al.  A decision support system based on support vector machines for diagnosis of the heart valve diseases , 2007, Comput. Biol. Medicine.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[17]  Yuehwern Yih,et al.  Knowledge acquisition through information granulation for imbalanced data , 2006, Expert Syst. Appl..

[18]  Kate Smith-Miles,et al.  A meta-learning approach to automatic kernel selection for support vector machines , 2006, Neurocomputing.

[19]  GuoHongyu,et al.  Learning from imbalanced data sets with boosting and data generation , 2004 .

[20]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[21]  Elif Derya íbeyli Analysis of EEG signals by combining eigenvector methods and multiclass support vector machines , 2008 .

[22]  Elif Derya Übeyli Analysis of EEG signals by combining eigenvector methods and multiclass support vector machines , 2008, Comput. Biol. Medicine.

[23]  A. V.DavidSánchez,et al.  Advanced support vector machines and kernel methods , 2003, Neurocomputing.

[24]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[25]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.