SVM Learning from Imbalanced Data by GA Sampling for Protein Domain Prediction

The performance of support vector machines (SVM) drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Some researchers have pointed out that it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly. In this paper, we proposed a novel sampling method based on genetic algorithms (GA) to rebalance the imbalanced training dataset for SVM. In order to evaluating the final classifiers more impartiality, AUC (area under ROC curve) is employed as the fitness function in GA. The experimental results show that the sampling strategy based on GA outperforms the random sampling method. And our method is prior to individual SVM for imbalanced protein domain boundary prediction. The accuracy of the prediction is about 70% with the AUC value 0.905.

[1]  Albert Y. Zomaya,et al.  Improved general regression network for protein domain boundary prediction , 2007, BMC Bioinformatics.

[2]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[5]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[6]  Ralf Zimmer,et al.  SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles , 2006, Bioinform..

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  Cen Li,et al.  Classifying imbalanced data using a bagging ensemble variation (BEV) , 2007, ACM-SE 45.

[11]  Albert Y. Zomaya,et al.  Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index , 2006, BMC Bioinformatics.

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[14]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[15]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[16]  Michael C. Mozer,et al.  Optimizing Classifier Performance Via the Wilcoxon-Mann-Whitney Statistic , 2003, ICML 2003.

[17]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[18]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[19]  Yanchun Liang,et al.  A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy , 2007, ISNN.

[20]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[21]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.