A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions

Background Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. Methodology Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. Significance Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

[1]  H. Akaike A new look at the statistical model identification , 1974 .

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[4]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[5]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[6]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[7]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[8]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[9]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[10]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[11]  Gavin H. Thomas,et al.  Completing the E. coli proteome: a database of gene products characterised since the completion of the genome sequence , 1999, Bioinform..

[12]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[13]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[14]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[17]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[18]  Poethig Rs,et al.  Life with 25,000 genes. , 2001 .

[19]  Peter D. Karp,et al.  Database verification studies of SWISS-PROT and GenBank , 2001, Bioinform..

[20]  Dmitrij Frishman,et al.  Functional and structural genomics using PEDANT , 2001, Bioinform..

[21]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[22]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[23]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[24]  Alexey I Nesvizhskii,et al.  Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20 , 2003, Journal of bacteriology.

[25]  Michael Y. Galperin,et al.  Sequence — Evolution — Function , 2003, Springer US.

[26]  Miguel A. Andrade-Navarro,et al.  Evaluation of annotation strategies using an entire genome sequence , 2003, Bioinform..

[27]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[28]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[29]  Michael Y. Galperin,et al.  In Silico Metabolic Model and Protein Expression of Haemophilus influenzae Strain Rd KW20 in Rich Medium. , 2004, Omics : a journal of integrative biology.

[30]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[31]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[32]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[33]  Richard J Roberts,et al.  Identifying Protein Function—A Call for Community Action , 2004, PLoS biology.

[34]  Michael Y. Galperin,et al.  'Conserved hypothetical' proteins: prioritization of targets for experimental study. , 2004, Nucleic acids research.

[35]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[36]  Gordon A Anderson,et al.  Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[37]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[38]  Michael S. Waterman,et al.  Computational Genome Analysis: An Introduction , 2007 .

[39]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[40]  R. Schmid Computational Genome Analysis: An Introduction. R. C. Deonier, S. Tavaré & M. S. Waterman. Springer. 2005. 515 pages. ISBN 0 387 98785 1. Price $79.95. (hardback) , 2006 .

[41]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[42]  Trupti Joshi,et al.  Quantitative assessment of relationship between sequence similarity and function similarity , 2007, BMC Genomics.

[43]  Jian Ye,et al.  BLAST: improvements for better sequence analysis , 2006, Nucleic Acids Res..

[44]  Arthur M. Lesk,et al.  Quantitative sequence-function relationships in proteins based on gene ontology , 2007, BMC Bioinformatics.

[45]  Dmitrij Frishman,et al.  Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.

[46]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[47]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[48]  Peter D. Karp,et al.  Annotation-based inference of transporter function , 2008, ISMB.

[49]  Alfonso Valencia,et al.  Modern Genome Annotation: The Biosapiens Network , 2008 .

[50]  Peter Tarczy-Hornoch,et al.  Validating annotations for uncharacterized proteins in Shewanella oneidensis. , 2008, Omics : a journal of integrative biology.

[51]  A. Valencia,et al.  Introduction BIOSAPIENS: A European Network of Excellence to develop genome annotation resources , 2008 .

[52]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[53]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[54]  Narmada Thanki,et al.  CDD: specific functional annotation with the Conserved Domain Database , 2008, Nucleic Acids Res..

[55]  David A. Lee,et al.  Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. , 2009, Journal of molecular biology.