On Position-Specific Scoring Matrix for Protein Function Prediction

While genome sequencing projects have generated tremendous amounts of protein sequence data for a vast number of genomes, substantial portions of most genomes are still unannotated. Despite the success of experimental methods for identifying protein functions, they are often lab intensive and time consuming. Thus, it is only practical to use in silico methods for the genome-wide functional annotations. In this paper, we propose new features extracted from protein sequence only and machine learning-based methods for computational function prediction. These features are derived from a position-specific scoring matrix, which has shown great potential in other bininformatics problems. We evaluate these features using four different classifiers and yeast protein data. Our experimental results show that features derived from the position-specific scoring matrix are appropriate for automatic function annotation.

[1]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[2]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[3]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[4]  Mei Liu,et al.  Protein Function Assignment through Mining Cross-Species Protein-Protein Interactions , 2008, PloS one.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[8]  E. Marcotte,et al.  Predicting functional linkages from gene fusions with confidence. , 2002, Applied bioinformatics.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[11]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[12]  Heiko Schoof,et al.  Protein function prediction and annotation in an integrated environment powered by web services (AFAWE) , 2008, Bioinform..

[13]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[14]  A. E. Hirsh,et al.  Coevolution of gene expression among interacting proteins , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[16]  D. Eisenberg,et al.  Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach , 2003, Genome Biology.

[17]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[18]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[19]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[20]  C. DeLisi,et al.  Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[22]  Christine A. Orengo,et al.  FFPred: an integrated feature-based function prediction server for vertebrate proteomes , 2008, Nucleic Acids Res..

[23]  Gerrit T. S. Beemster,et al.  Gene expression trends and protein features effectively complement each other in gene function prediction , 2009, Bioinform..

[24]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[25]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[26]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[27]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  L. Brieman,et al.  Classification and Regression Trees , 2020, Biostatistics with R.

[29]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[30]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[31]  Mei Liu,et al.  Knowledge-guided inference of domain–domain interactions from incomplete protein–protein interaction networks , 2009, Bioinform..

[32]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[33]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[36]  Edward M Marcotte,et al.  Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages , 2003, Nature Biotechnology.

[37]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[38]  Philip E. Bourne,et al.  A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm , 2001, Nucleic Acids Res..

[39]  P. Kersey,et al.  In Silico Characterization of Proteins: UniProt, InterPro and Integr8 , 2008, Molecular biotechnology.

[40]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[41]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[42]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[43]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[44]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.