Quantitative assessment of relationship between sequence similarity and function similarity

BackgroundComparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way.ResultsWe present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs.ConclusionVarious sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.

[1]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[2]  E V Koonin Computational genomics , 2001, Current Biology.

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[5]  B Rost,et al.  Pitfalls of protein sequence analysis. , 1996, Current opinion in biotechnology.

[6]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[7]  A. Godzik,et al.  Sensitive sequence comparison as protein function predictor. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  Chris P. Ponting,et al.  Issues in Predicting Protein Function From Sequence , 2001, Briefings Bioinform..

[9]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[10]  Chris Sander,et al.  EUCLID: automatic classification of proteins in functional classes by their database annotations , 1998, Bioinform..

[11]  Reinhard Schneider,et al.  GeneCrunch: Experiences on the SGI POWER CHALLENGEarray with Bioinformatics applications , 1996 .

[12]  M A Andrade,et al.  Bioinformatics: from genome data to biological knowledge. , 1997, Current opinion in biotechnology.

[13]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[14]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[15]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[16]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[17]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[18]  Jeffrey J. P. Tsai,et al.  Subcellular localization prediction of eukaryotic proteins using functional domain frequency measure , 2003, Fifth International Symposium on Multimedia Software Engineering, 2003. Proceedings..

[19]  Lawrence Hunter,et al.  Predicting Enzyme Function from Sequence: A Systematic Appraisal , 1997, ISMB.

[20]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[21]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[22]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[23]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[24]  T. Joshi,et al.  Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. , 2004, Omics : a journal of integrative biology.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  J M Thornton,et al.  From Genome to Function , 2001, Science.

[27]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[29]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[30]  Frances M. G. Pearl,et al.  Protein folds, functions and evolution. , 1999, Journal of molecular biology.

[31]  Xinglai Ji,et al.  BSubLoc: database of protein subcellular localization , 2004, Nucleic Acids Res..

[32]  C. Sander,et al.  From genome sequences to protein function , 1994 .

[33]  P D Karp,et al.  A protocol for maintaining multidatabase referential integrity. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[34]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[35]  C. Sander,et al.  Computational comparisons of model genomes. , 1996, Trends in biotechnology.