Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations

Sequence similarity is a commonly used measure to compare proteins. With the increasing use of ontologies, semantic (function) similarity is getting importance. The correlation between these measures has been applied in the evaluation of new semantic similarity methods, and in protein function prediction. In this research, we investigate the relationship between the two similarity methods. The results suggest absence of a strong correlation between sequence and semantic similarities. There is a large number of proteins with low sequence similarity and high semantic similarity. We observe that Pearson's correlation coefficient is not sufficient to explain the nature of this relationship. Interestingly, the term semantic similarity values above 0 and below 1 do not seem to play a role in improving the correlation. That is, the correlation coefficient depends only on the number of common GO terms in proteins under comparison, and the semantic similarity measurement method does not influence it. Semantic similarity and sequence similarity have a distinct behavior. These findings are of significant effect for future works on protein comparison, and will help understand the semantic similarity between proteins in a better way.

[1]  James A. Hendler,et al.  The Semantic Web 10 , 2011 .

[2]  Catia Pesquita,et al.  Evaluating GO-based Semantic Similarity Measures , 2007 .

[3]  Mário J. Silva,et al.  Disjunctive shared information between ontology concepts: application to Gene Ontology , 2011, J. Biomed. Semant..

[4]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[5]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[6]  Russ B. Altman,et al.  Including Biological Literature Improves Homology Search , 2001, Pacific Symposium on Biocomputing.

[7]  Michael J. E. Sternberg,et al.  SAWTED: Structure Assignment With Text Description-Enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons , 2000, Bioinform..

[8]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[9]  Mário J. Silva,et al.  Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors , 2005, CIKM '05.

[10]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[11]  Trupti Joshi,et al.  Quantitative assessment of relationship between sequence similarity and function similarity , 2007, BMC Genomics.

[12]  Constance Jeffery,et al.  Moonlighting proteins , 2010, Genome Biology.

[13]  Allan C. Wilson,et al.  Adaptive evolution in the stomach lysozymes of foregut fermenters , 1987, Nature.

[14]  Jing Zhu,et al.  Revealing and avoiding bias in semantic similarity scores for protein pairs , 2010, BMC Bioinformatics.

[15]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[16]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[17]  Haixuan Yang,et al.  Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty , 2012, Bioinform..

[18]  F. J. Anscombe,et al.  Graphs in Statistical Analysis , 1973 .

[19]  Helena Sofia Pinto,et al.  The Next Generation of Similarity Measures that Fully Explore the Semantics in Biomedical Ontologies , 2013, J. Bioinform. Comput. Biol..

[20]  Delphine Pessoa,et al.  CESSM: collaborative evaluation of semantic similarity measures , 2009 .

[21]  Changhui Yan,et al.  A Graph-Based Semantic Similarity Measure for the gene Ontology , 2011, J. Bioinform. Comput. Biol..

[22]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[23]  Nicola S. Clayton,et al.  The Mentality of Crows: Convergent Evolution of Intelligence in Corvids and Apes , 2004, Science.

[24]  James A. Hendler,et al.  The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities , 2001 .

[25]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[26]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[27]  Philip S. Yu,et al.  Measure the Semantic Similarity of GO Terms Using Aggregate Information Content , 2013, ISBRA.

[28]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.