Protein function prediction by massive integration of evolutionary analyses and multiple data sources

BackgroundAccurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt.MethodsOur method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure.ResultsWe propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments.ConclusionsOverall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of prediction accuracy, in fostering advancements and new ideas, and ultimately in recording progress.

[1]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[2]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[3]  Ian Sillitoe,et al.  Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis , 2011, Nucleic Acids Res..

[4]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[5]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[6]  Anna E. Lobley,et al.  Human protein function prediction : application of machine learning for integration of heterogeneous data sources , 2010 .

[7]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[8]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[9]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[10]  Youping Deng,et al.  Recent advances in clustering methods for protein interaction networks , 2010, BMC Genomics.

[11]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[12]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  Christine A. Orengo,et al.  FFPred: an integrated feature-based function prediction server for vertebrate proteomes , 2008, Nucleic Acids Res..

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[17]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[18]  Ron Shamir,et al.  Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium , 2005, Molecular systems biology.

[19]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[20]  BMC Bioinformatics , 2005 .

[21]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[22]  Frederick P Roth,et al.  A race through the maze of genomic evidence , 2008, Genome Biology.

[23]  Asa Ben-Hur,et al.  The use of gene ontology evidence codes in preventing classifier assessment bias , 2009, Bioinform..

[24]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  D. Kihara,et al.  PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data , 2009, Proteins.

[26]  Michael I. Jordan,et al.  Genome-scale phylogenetic function annotation of large and diverse protein families. , 2011, Genome research.

[27]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.