Metrics for GO based protein semantic similarity: a systematic evaluation

BackgroundSeveral semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations.ResultsWe conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation.ConclusionsThis work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.

[1]  Olivier Bodenreider,et al.  Ontology-driven similarity approaches to supporting gene func- tional assessment , 2005 .

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[4]  Mário J. Silva,et al.  Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors , 2005, CIKM '05.

[5]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[6]  Trupti Joshi,et al.  Quantitative assessment of relationship between sequence similarity and function similarity , 2007, BMC Genomics.

[7]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[8]  Catia Pesquita,et al.  Evaluating GO-based Semantic Similarity Measures , 2007 .

[9]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[10]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[11]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[12]  Lei Qin,et al.  Semantic search among heterogeneous biological databases based on gene ontology. , 2004, Acta biochimica et biophysica Sinica.

[13]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[14]  Angel Rubio,et al.  Correlation between gene expression and GO semantic similarity , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[16]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[17]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[18]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[19]  Christian Posse,et al.  XOA: Web-Enabled Cross-Ontological Analytics , 2007, 2007 IEEE Congress on Services (Services 2007).

[20]  Doheon Lee,et al.  Modularized learning of genetic interaction networks from biological annotations and mRNA expression data , 2005, Bioinform..

[21]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[22]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[23]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[24]  Anita Burgun-Parenthoine,et al.  A transversal approach to predict gene product networks from ontology-based similarity , 2007, BMC Bioinformatics.

[25]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[26]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[27]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[28]  Fatima Al-Shahrour,et al.  Ontology-driven approaches to analyzing data in functional genomics. , 2006, Methods in molecular biology.

[29]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[30]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[31]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[32]  Lothar Reichel,et al.  The relationship between protein sequences and their gene ontology functions , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[33]  Yang Dai,et al.  Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction , 2006, BMC Bioinformatics.

[34]  Catia Pesquita,et al.  ProteInOn: A Web Tool for Protein Semantic Similarity , 2007 .

[35]  Osamu Yoshie,et al.  Extensive expansion and diversification of the chemokine gene family in zebrafish: Identification of a novel chemokine subfamily CX , 2008, BMC Genomics.

[36]  Safaai Deris,et al.  A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences , 2008, J. Biomed. Informatics.