Prediction for Essential Proteins with the Support Vector Machine ∗

Essential proteins affect the cellular life deeply, but it is hard to identify them. Protein-protein interaction is one of the ways to disclose whether a protein is essential or not. We notice that many researchers use the feature set composed of topology properties from protein-protein interaction to predict the essential proteins. However, the functionality of a protein is also a clue to determine its essentiality. The goal of this paper is to build SVM models for predicting the essential proteins. In our experiments, we download Scere20070107, which contains 4873 proteins and 17166 interactions, from DIP database. The ratio of essential proteins to nonessential proteins is nearly 1:4, so it is imbalanced. In the imbalanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.5197, 0.4671, 0.2428 and 0.2543, respectively. We build another balanced dataset with ratio 1:1. For balanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.7742, 0.5484, 0.3603 and 0.3828, respectively. Our results are superior to all previous results with various measurements.

[1]  Chih-Ying Lin,et al.  Disulfide bonding state prediction with SVM based on protein types , 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA).

[2]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Chung-Yen Lin,et al.  Hubba: hub objects analyzer—a framework of interactome hubs identification for network biology , 2008, Nucleic Acids Res..

[5]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[6]  P. Stadler,et al.  Centers of complex networks. , 2003, Journal of theoretical biology.

[7]  Manoj Pratim Samanta,et al.  Global snapshot of a protein interaction network - a percolation based approach , 2003, Bioinform..

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Hsuan-Cheng Huang,et al.  Predicting essential genes based on network and sequence analysis. , 2009, Molecular bioSystems.

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Ney Lemke,et al.  Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information , 2009, BMC Bioinformatics.

[12]  Mark Gerstein,et al.  The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics , 2007, PLoS Comput. Biol..

[13]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[14]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[15]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[16]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[19]  G. Arndt,et al.  Genome‐wide screening for gene function using RNAi in mammalian cells , 2005, Immunology and cell biology.

[20]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[22]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[23]  Chia-Hao Chin,et al.  從蛋白質交互作用網絡中偵測必要性蛋白質與蛋白質功能模組 ; Prediction of Essential Proteins and Functional Modules from Protein-Protein Interaction Networks , 2010 .

[24]  W. Gehring,et al.  Functional redundancy: the respective roles of the two sloppy paired genes in Drosophila segmentation. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[25]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .