Evaluating the effect of annotation size on measures of semantic similarity

BackgroundOntologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products.ResultsHere, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities, difference in annotation size and to the depth or specificity of annotation classes. We find that most similarity measures are sensitive to the number of annotations of entities, difference in annotation size as well as to the depth of annotation classes; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation.ConclusionsOur findings may have significant impact on the interpretation of results that rely on measures of semantic similarity, and we demonstrate how the sensitivity to annotation size can lead to a bias when using semantic similarity to predict protein-protein interactions.

[1]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.

[2]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[3]  João D. Ferreira,et al.  Semantic Similarity for Automatic Classification of Chemical Compounds , 2010, PLoS Comput. Biol..

[4]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[5]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[6]  Wolfgang Meissner,et al.  Reverse crosstalk of TGFβ and PPARβ/δ signaling identified by transcriptional profiling , 2010, Nucleic Acids Res..

[7]  Thomas Lengauer,et al.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms , 2010, Bioinform..

[8]  Andrzej J. Bojarski,et al.  Multiple conformational states in retrospective virtual screening – homology models vs. crystal structures: beta-2 adrenergic receptor case study , 2015, Journal of Cheminformatics.

[9]  Paul Pavlidis,et al.  “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks , 2012, PLoS Comput. Biol..

[10]  Thomas C. Wiegers,et al.  Disease model curation improvements at Mouse Genome Informatics , 2012, Database J. Biol. Databases Curation.

[11]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[12]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[13]  Paul N. Schofield,et al.  PhenomeNET: a whole-phenome approach to disease gene discovery , 2011, Nucleic acids research.

[14]  Janna Hastings,et al.  Exploiting disjointness axioms to improve semantic similarity measures , 2013, Bioinform..

[15]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[16]  G. Gkoutos,et al.  Datamining with Ontologies. , 2016, Methods in molecular biology.

[17]  A. Rector,et al.  Relations in biomedical ontologies , 2005, Genome Biology.

[18]  João D. Ferreira,et al.  Improving chemical entity recognition through h-index based semantic similarity , 2015, Journal of Cheminformatics.

[19]  Michel Dumontier,et al.  Relations as patterns: bridging the gap between OBO and OWL , 2010, BMC Bioinformatics.

[20]  Marcel H. Schulz,et al.  Exact score distribution computation for ontological similarity searches , 2011, BMC Bioinformatics.

[21]  Jung-Hsien Chiang,et al.  Discovering novel protein-protein interactions by measuring the protein semantic similarity from the biomedical literature , 2014, J. Bioinform. Comput. Biol..

[22]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[23]  Yan Zhou,et al.  Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data , 2008, BMC Bioinformatics.

[24]  Marcel H. Schulz,et al.  Clinical diagnostics in human genetics with semantic similarity searches in ontologies. , 2009, American journal of human genetics.

[25]  Paul N. Schofield,et al.  An integrative, translational approach to understanding rare and orphan genetically based diseases , 2013, Interface Focus.

[26]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[27]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[28]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[29]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[30]  Kathleen Marchal,et al.  Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[31]  Deyi Xiong,et al.  Semantic Similarity from Natural Language and Ontology Analysis , 2016, Computational Linguistics.

[32]  Paul Pavlidis,et al.  Gene Ontology term overlap as a measure of gene functional similarity , 2008, BMC Bioinformatics.