Supervised Biomedical Semantic Similarity

Semantic similarity between concepts in knowledge graphs is essential for several bioinformatics applications, including the prediction of protein-protein interactions and the discovery of associations between diseases and genes. Although knowledge graphs describe entities in terms of several perspectives (or semantic aspects), state-of-the-art semantic similarity measures are general-purpose. This can represent a challenge since different use cases for the application of semantic similarity may need different similarity perspectives and ultimately depend on expert knowledge for manual fine-tuning. We present a new approach that uses supervised machine learning to tailor aspect-oriented semantic similarity measures to fit a particular view on biological similarity or relatedness. We implement and evaluate it using different combinations of representative semantic similarity measures and machine learning methods with four biological similarity views: protein-protein interaction, protein function similarity, protein sequence similarity and phenotype-based gene similarity. The results demonstrate that our approach outperforms non-supervised methods, producing semantic similarity models that fit different biological perspectives significantly better than the commonly used manual combinations of semantic aspects.

[1]  Xin Gao,et al.  Semantic similarity and machine learning with ontologies , 2020, Briefings Bioinform..

[2]  Catia Pesquita,et al.  Evolving knowledge graph similarity for supervised learning in complex biomedical domains , 2020, BMC Bioinformatics.

[3]  Catia Pesquita,et al.  A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain , 2020, ESWC.

[4]  Yuh-Jyh Hu,et al.  Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme , 2019, BMC Bioinformatics.

[5]  Tudor Groza,et al.  Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources , 2018, Nucleic Acids Res..

[6]  Muhammad Asif,et al.  Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology , 2018, bioRxiv.

[7]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[8]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[9]  Maria-Esther Vidal,et al.  GARUM: A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics , 2018, DEXA.

[10]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[11]  Andrej Kastrin,et al.  Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning , 2018, PloS one.

[12]  Muhammad Abdul Qadir,et al.  Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Xin Gao,et al.  OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction , 2018, Bioinform..

[14]  Jinmeng Jia,et al.  An improved approach to infer protein-protein interaction based on a hierarchical vector space model , 2018, BMC Bioinform..

[15]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[16]  Xin Gao,et al.  Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations , 2018, Bioinform..

[17]  Kevin Chen-Chuan Chang,et al.  A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[18]  E. Bagheri,et al.  Semantic annotation in biomedicine: the current landscape , 2017, J. Biomed. Semant..

[19]  Ping Zhang,et al.  Large-scale structural and textual similarity-based mining of knowledge graph to predict drug-drug interactions , 2017, J. Web Semant..

[20]  Heiko Paulheim,et al.  RDF2Vec: RDF Graph Embeddings for Data Mining , 2016, SEMWEB.

[21]  York Sure-Vetter,et al.  GADES: A Graph-based Semantic Similarity Measure , 2016, SEMANTiCS.

[22]  Shu-Bo Zhang,et al.  Protein-protein interaction inference based on semantic similarity of Gene Ontology terms. , 2016, Journal of theoretical biology.

[23]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[24]  Sylvie Ranwez,et al.  Semantic Similarity from Natural Language and Ontology Analysis , 2015, Synthesis Lectures on Human Language Technologies.

[25]  Paul N. Schofield,et al.  The role of ontologies in biological and biomedical research: a functional perspective , 2015, Briefings Bioinform..

[26]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[27]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[28]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[29]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[30]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[31]  Paul N. Schofield,et al.  PhenomeNET: a whole-phenome approach to disease gene discovery , 2011, Nucleic acids research.

[32]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Gary D. Bader,et al.  An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology , 2010, BMC Bioinformatics.

[35]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[36]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[37]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[38]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[39]  Steffen Staab,et al.  Taxonomy Learning - Factoring the Structure of a Taxonomy into a Semantic Classification Decision , 2002, COLING.

[40]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[41]  Catia Pesquita,et al.  The Supervised Semantic Similarity Toolkit , 2022, ESWC.

[42]  Catia Pesquita,et al.  Semantic Similarity in the Gene Ontology. , 2017, Methods in molecular biology.

[43]  Wolfram Wöß,et al.  Towards a Definition of Knowledge Graphs , 2016, SEMANTiCS.

[44]  Catia Pesquita,et al.  Evaluating GO-based Semantic Similarity Measures , 2007 .

[45]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..