BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins

BackgroundAutomated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation.ResultsWe present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1.ConclusionsBLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at http://ekhidna.biocenter.helsinki.fi/poxo/blannotator/.

[1]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[2]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[3]  Elisabeth Coudert,et al.  HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot , 2008, Nucleic Acids Res..

[4]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[5]  M. Robinson‐Rechavi,et al.  How confident can we be that orthologs are similar, but paralogs differ? , 2009, Trends in genetics : TIG.

[6]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[7]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[8]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[9]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[10]  Liisa Holm,et al.  Genome Sequence of Lactobacillus crispatus ST1 , 2010, Journal of bacteriology.

[11]  V. Kunin,et al.  CRISPR — a widespread system that provides acquired resistance against phages in bacteria and archaea , 2008, Nature Reviews Microbiology.

[12]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[13]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[14]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[18]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[19]  Claudine Médigue,et al.  Annotation, comparison and databases for hundreds of bacterial genomes. , 2007, Research in microbiology.

[20]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[21]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[22]  X. Gu,et al.  Expression divergence between duplicate genes. , 2005, Trends in genetics : TIG.

[23]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[24]  Peer Bork,et al.  Protein function space: viewing the limits or limited by our view? , 2007, Current opinion in structural biology.

[25]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[26]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[27]  Dmitrij Frishman,et al.  Functional and structural genomics using PEDANT , 2001, Bioinform..

[28]  Akiyasu C. Yoshizawa,et al.  KAAS: an automatic genome annotation and pathway reconstruction server , 2007, Environmental health perspectives.

[29]  BMC Bioinformatics , 2005 .

[30]  M. Huynen,et al.  Benchmarking ortholog identification methods using functional genomics data , 2006, Genome Biology.

[31]  L. Mirny,et al.  Using orthologous and paralogous proteins to identify specificity determining residues , 2002, Genome Biology.

[32]  Alfonso Valencia,et al.  Automatic annotation of protein function based on family identification , 2003, Proteins.

[33]  Rolf Apweiler,et al.  UniSave: the UniProtKB Sequence/Annotation Version database , 2006, Bioinform..

[34]  Gertraud Burger,et al.  AutoFACT: An Automatic Functional Annotation and Classification Tool , 2005, BMC Bioinformatics.

[35]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[36]  Christos A. Ouzounis,et al.  Clustering the annotation space of proteins , 2005, BMC Bioinformatics.

[37]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[38]  Avi Shoshan,et al.  Large-scale protein annotation through gene ontology. , 2002, Genome research.