ESG: extended similarity group method for automated protein function prediction

MOTIVATION Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability. RESULTS We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains. AVAILABILITY ESG web server is available for automated protein function prediction at http://dragon.bio.purdue.edu/ESG/.

[1]  Alfonso Valencia,et al.  Assessment of predictions submitted for the CASP7 function prediction category. , 2007, Proteins.

[2]  A. Sali,et al.  Detection of homologous proteins by an intermediate sequence search , 2004, Protein science : a publication of the Protein Society.

[3]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[4]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[5]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[6]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[7]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[8]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[9]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[10]  Roland Eils,et al.  GOPET: A tool for automated predictions of Gene Ontology terms , 2006, BMC Bioinformatics.

[11]  Georges Belfort,et al.  A universal pathway for amyloid nucleus and precursor formation for insulin , 2009, Proteins.

[12]  Søren Brunak,et al.  Functionality of system components: conservation of protein function in protein feature space. , 2003, Genome research.

[13]  Zhilei Chen,et al.  A highly sensitive selection method for directed evolution of homing endonucleases , 2005, Nucleic acids research.

[14]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[17]  Daisuke Kihara,et al.  Function Prediction of uncharacterized proteins , 2007, J. Bioinform. Comput. Biol..

[18]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[19]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[20]  N. Mulder,et al.  InterPro and InterProScan: tools for protein sequence classification and comparison. , 2007, Methods in molecular biology.

[21]  Ori Sasson,et al.  ProtoNet 4.0: A hierarchical classification of one million protein sequences , 2004, Nucleic Acids Res..

[22]  Adam Godzik,et al.  New avenues in protein function prediction , 2006, Protein science : a publication of the Protein Society.

[23]  Nigel J. Martin,et al.  Gene3D: comprehensive structural and functional annotation of genomes , 2007, Nucleic Acids Res..

[24]  D. Kihara,et al.  PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data , 2009, Proteins.

[25]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[26]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[27]  Olivier Poch,et al.  PipeAlign: a new toolkit for protein family analysis , 2003, Nucleic Acids Res..

[28]  Carl J. Schmidt,et al.  GoFigure: Automated Gene OntologyTM annotation , 2003, Bioinform..

[29]  Michal Linial,et al.  Connect the dots: exposing hidden protein family connections from the entire sequence tree , 2008, ECCB.

[30]  Lothar Reichel,et al.  The relationship between protein sequences and their gene ontology functions , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[31]  Rolf Apweiler,et al.  InterPro and InterProScan , 2007 .

[32]  Daisuke Kihara,et al.  New paradigm in protein function prediction for large scale omics analysis. , 2008, Molecular bioSystems.

[33]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[34]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..