Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests

Essential proteins include the minimum required set of proteins to support cell life. Identifying essential proteins is important for understanding the cellular processes of an organism. However, identifying essential proteins experimentally is extremely time-consuming and labor-intensive. Alternative methods must be developed to examine essential proteins. There were two goals in this study: identifying the important features and building learning machines for discriminating essential proteins. Data for Saccharomyces cerevisiae and Escherichia coli were used. We first collected information from a variety of sources. We next proposed a modified backward feature selection method and build support vector machines (SVM) predictors based on the selected features. To evaluate the performance, we conducted cross-validations for the originally imbalanced data set and the down-sampling balanced data set. The statistical tests were applied on the performance associated with obtained feature subsets to confirm their significance. In the first data set, our best values of F-measure and Matthews correlation coefficient (MCC) were 0.549 and 0.495 in the unbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.770 and 0.545, respectively. In the second data set, our best values of F-measure and MCC were 0.421 and 0.407 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC were 0.718 and 0.448, respectively. The experimental results show that our selected features are compact and the performance improved. Prediction can also be conducted by users at the following internet address: http://bio2.cse.nsysu.edu.tw/esspredict.aspx.

[1]  Chih-Ying Lin,et al.  Disulfide bonding state prediction with SVM based on protein types , 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA).

[2]  Su-Ping Chen,et al.  INTRUSION DETECTION USING A HYBRID SUPPORT VECTOR MACHINE BASED ON ENTROPY AND TF-IDF , 2008 .

[3]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[4]  G. Arndt,et al.  Genome‐wide screening for gene function using RNAi in mammalian cells , 2005, Immunology and cell biology.

[5]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[6]  Ye-In Chang,et al.  MINING SEQUENCE MOTIFS FROM PROTEIN DATABASES BASED ON A BIT PATTERN APPROACH , 2012 .

[7]  H. Bussey,et al.  Large‐scale essential gene identification in Candida albicans and applications to antifungal drug discovery , 2003, Molecular microbiology.

[8]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[9]  Jian Ma,et al.  Igf-bagging: Information gain based feature selection for bagging , 2011 .

[10]  Kuldip Singh,et al.  A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class , 2008, EURASIP J. Bioinform. Syst. Biol..

[11]  A. Emili,et al.  Global Functional Atlas of Escherichia coli Encompassing Previously Uncharacterized Proteins , 2009, PLoS biology.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Chia-Hao Chin,et al.  從蛋白質交互作用網絡中偵測必要性蛋白質與蛋白質功能模組 ; Prediction of Essential Proteins and Functional Modules from Protein-Protein Interaction Networks , 2010 .

[14]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[15]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[16]  Ney Lemke,et al.  Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information , 2009, BMC Bioinformatics.

[17]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[18]  Chang-Biau Yang,et al.  Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests , 2012, ICMLA.

[19]  Stephen C. J. Parker,et al.  Towards the identification of essential genes using targeted genome sequencing and comparative analysis , 2006, BMC Genomics.

[20]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[21]  Manoj Pratim Samanta,et al.  Global snapshot of a protein interaction network - a percolation based approach , 2003, Bioinform..

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[24]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[25]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[27]  Chung-Yen Lin,et al.  Hubba: hub objects analyzer—a framework of interactome hubs identification for network biology , 2008, Nucleic Acids Res..

[28]  Peden Jf,et al.  Analysis of codon usage. , 2000 .

[29]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[30]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[31]  Jimmy T. Efird,et al.  Informational Odds Ratio: A Useful Measure of Epidemiologic Association in Environment Exposure Studies , 2012, Environmental health insights.

[32]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[33]  Filiberto Pla,et al.  Supervised feature selection by clustering using conditional mutual information-based distances , 2010, Pattern Recognit..

[34]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[35]  W. Gehring,et al.  Functional redundancy: the respective roles of the two sloppy paired genes in Drosophila segmentation. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[36]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[37]  Hsuan-Cheng Huang,et al.  Predicting essential genes based on network and sequence analysis. , 2009, Molecular bioSystems.

[38]  Chang-Biau Yang,et al.  Prediction for Essential Proteins with the Support Vector Machine ∗ , 2011 .

[39]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[40]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[41]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[42]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[43]  Fuji Ren,et al.  AUTOMATIC TEXT SUMMARIZATION USING SUPPORT VECTOR MACHINE , 2009 .

[44]  P. Stadler,et al.  Centers of complex networks. , 2003, Journal of theoretical biology.

[45]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[46]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Chuan Yi Tang,et al.  Feature selection and combination criteria for improving predictive accuracy in protein structure classification , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[48]  Mark Gerstein,et al.  The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics , 2007, PLoS Comput. Biol..

[49]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .