An integrative measure of graph- and vector-based semantic similarity using information content distance

Gene Ontology (GO) and its annotation data have been widely used for genomic and proteomic analysis. In the past few years, various semantic similarity measures using GO have been proposed to quantify functional similarity between two proteins and assess validity of protein-protein interactions (PPIs). They are categorized as pairwise and groupwise approaches according to the strategies of deriving protein-to-protein functional similarity. We propose a novel semantic similarity measure, called simVICD, which is a graph-and vector-based groupwise approach. This method computes the magnitude of a common induced subgraph as semantic similarity between two sets of terms annotating two proteins, respectively. The magnitude of the common induced subgraph is represented as the Euclidean norm of a vector having information content distance of all possible directed shortest paths in the induced subgraph. Our experimental results show that the proposed groupwise approach, simVICD, and a previous integrative pairwise approach, simICND, outperform the other existing semantic similarity methods in predicting protein complexes and identifying essential proteins.

[1]  Pietro Hiram Guzzi,et al.  M-Finder: Uncovering functionally associated proteins from interactome data integrated with GO annotations , 2013, Proteome Science.

[2]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[3]  Xiaojun Qi,et al.  A shortest-path graph kernel for estimating gene product semantic similarity , 2011, J. Biomed. Semant..

[4]  Hisham Al-Mubaid,et al.  A New Path Length Measure Based on GO for Gene Similarity with Evaluation using SGD Pathways , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[5]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[6]  Sidahmed Benabderrahmane,et al.  IntelliGO: a new vector-based semantic similarity measure including annotation origin , 2010, BMC Bioinformatics.

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Gary D. Bader,et al.  An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology , 2010, BMC Bioinformatics.

[9]  Mário J. Silva,et al.  Disjunctive shared information between ontology concepts: application to Gene Ontology , 2011, J. Biomed. Semant..

[10]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[11]  Yan Lin,et al.  DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements , 2013, Nucleic Acids Res..

[12]  S. Pu,et al.  Up-to-date catalogues of yeast protein complexes , 2008, Nucleic acids research.

[13]  Yi Pan,et al.  Identification of Essential Proteins Based on Edge Clustering Coefficient , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Fabian J. Theis,et al.  MIPS: curated databases and comprehensive secondary data resources in 2010 , 2010, Nucleic Acids Res..

[15]  Kara Dolinski,et al.  The BioGRID interaction database: 2015 update , 2014, Nucleic Acids Res..

[16]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.