Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests

Essential proteins affect the cellular life deeply, but it is extreme time-consuming and labor-intensive to discriminate them experimentally. The goal of this paper is to identify the features which are crucial for discriminating protein essentiality and build learning machines for prediction. We first collect features from a variety of sources. Then we adopt a backward feature selection method and use the selected features to build SVM predictors. The cross validations are conducted on the originally imbalanced data set as well as the down-sampling balanced data set. The performance of these feature subsets are then subject to the statistical test to confirm their significance. For the imbalanced data set, our best values of F-measure and MCC are 0.549 and 0.495, respectively. For balanced data set, our best values of F-measure and MCC of our models are 0.770 and 0.545, respectively. The results are superior to all previous results under various performance measures.

[1]  A. Emili,et al.  Global Functional Atlas of Escherichia coli Encompassing Previously Uncharacterized Proteins , 2009, PLoS biology.

[2]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[3]  Ney Lemke,et al.  Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information , 2009, BMC Bioinformatics.

[4]  Ye-In Chang,et al.  MINING SEQUENCE MOTIFS FROM PROTEIN DATABASES BASED ON A BIT PATTERN APPROACH , 2012 .

[5]  Filiberto Pla,et al.  Supervised feature selection by clustering using conditional mutual information-based distances , 2010, Pattern Recognit..

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  Jian Ma,et al.  Igf-bagging: Information gain based feature selection for bagging , 2011 .

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Manoj Pratim Samanta,et al.  Global snapshot of a protein interaction network - a percolation based approach , 2003, Bioinform..

[10]  Fuji Ren,et al.  AUTOMATIC TEXT SUMMARIZATION USING SUPPORT VECTOR MACHINE , 2009 .

[11]  Stephen C. J. Parker,et al.  Towards the identification of essential genes using targeted genome sequencing and comparative analysis , 2006, BMC Genomics.

[12]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[13]  P. Stadler,et al.  Centers of complex networks. , 2003, Journal of theoretical biology.

[14]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[15]  Chung-Yen Lin,et al.  Hubba: hub objects analyzer—a framework of interactome hubs identification for network biology , 2008, Nucleic Acids Res..

[16]  Peden Jf,et al.  Analysis of codon usage. , 2000 .

[17]  W. Gehring,et al.  Functional redundancy: the respective roles of the two sloppy paired genes in Drosophila segmentation. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[18]  G. Arndt,et al.  Genome‐wide screening for gene function using RNAi in mammalian cells , 2005, Immunology and cell biology.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[23]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[24]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Chuan Yi Tang,et al.  Feature selection and combination criteria for improving predictive accuracy in protein structure classification , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[26]  Mark Gerstein,et al.  The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics , 2007, PLoS Comput. Biol..

[27]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[28]  Chang-Biau Yang,et al.  Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests , 2012, 2012 11th International Conference on Machine Learning and Applications.

[29]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[30]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[31]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[32]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Kuldip Singh,et al.  A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class , 2008, EURASIP J. Bioinform. Syst. Biol..

[34]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[35]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[36]  Hsuan-Cheng Huang,et al.  Predicting essential genes based on network and sequence analysis. , 2009, Molecular bioSystems.

[37]  Chih-Ying Lin,et al.  Disulfide bonding state prediction with SVM based on protein types , 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA).

[38]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[39]  Jimmy T. Efird,et al.  Informational Odds Ratio: A Useful Measure of Epidemiologic Association in Environment Exposure Studies , 2012, Environmental health insights.

[40]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[41]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..