Modeling sequence and function similarity between proteins for protein functional annotation

A common task in biological research is to predict function for proteins by comparing sequences between proteins of known and unknown function. This is often done using pair-wise sequence alignment algorithms (e.g. BLAST). A problem with this approach is the assumption of a simple equivalence between a minimum sequence similarity threshold and the function similarity between proteins. This assumption is based on the binary concept of homology in that proteins are or not homologous. The relationship between sequence and function however is more complex as well as pertinent for predicting protein function, e.g. evaluating BLAST alignments or developing training sets for profile models based on functional rather than homologous groupings. Our motivation for this study was to model sequence and function similarity between proteins to gain insights into the "sequence-function similarity relationship between proteins for predicting function. Using our model we found that function similarity generally increases with sequence similarity but with a high degree of variability. This result has implications for pair-wise approaches in that it appears sequence similarity must be very high to ensure high function similarity. Profile models which enable higher sensitivity are a potential solution. However, multiple sequences alignments (a necessary prerequisite) are a problem in that current algorithms have difficulty aligning sequences with very low sequence similarity, which is common in our data set, or are intractable for high numbers of sequences. Given the importance of predicting protein function and the need for multiple sequence alignments, algorithms for accomplishing this task should be further refined and developed.

[1]  Alexey I Nesvizhskii,et al.  Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20 , 2003, Journal of bacteriology.

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[4]  Michael Y. Galperin,et al.  Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae , 2004 .

[5]  Peter Tarczy-Hornoch,et al.  Validating annotations for uncharacterized proteins in Shewanella oneidensis. , 2008, Omics : a journal of integrative biology.

[6]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[7]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[8]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[9]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[10]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[11]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[12]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[13]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[14]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[15]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[16]  Desmond G. Higgins,et al.  Analysis and Comparison of Benchmarks for Multiple Sequence Alignment , 2006, Silico Biol..

[17]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[18]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[19]  Eugene Kolker,et al.  Quantifying Protein Function Specificity in the Gene Ontology , 2010, Standards in genomic sciences.

[20]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[21]  Gordon A Anderson,et al.  Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.