OntoSem: an Ontology Semantic Representation Methodology for Biomedical Domain

Ontologies are essential description tools for biomedical concepts and entities, supporting biomedical fundamental research such as semantic similarity analysis, protein-protein interaction prediction and so on. An increasing amount of ontology-like domain knowledge is published in scientific publications, meanwhile, advanced natural language processing (NLP) techniques have been widespread to extract information from text resources automatically, both of which facilitate the exploration of the semantic representation of biomedical ontologies. We propose a novel distributional semantic representation methodology based on the combination of two pre-trained and domain-specific word embedding tools, the non-contextualized Word2Vec and the context-dependent NCBI-blueBERT, to enhance the encoding ability for biomedical ontologies. Furthermore, we utilize a randomly initialized bidirectional LSTM to project the obtained word vector sequence to a fixed-length sentence vector, facilitating a flexible and uniform way for the computation of downstream tasks. We evaluate our method in two categories of tasks: the similarity access of ontology terms, and the ontology annotationbased protein-protein interaction classification. Experimental results demonstrate that our method provides encouraging results compared to the baselines in all tests. Our approach offers promising opportunities for representing ontologies semantics and in turn characterizing entities including proteins in biomedical research.

[1]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[2]  Xin Gao,et al.  Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations , 2018, Bioinform..

[3]  Behnam Neyshabur,et al.  Predicting protein‐protein interactions through sequence‐based deep learning , 2018, Bioinform..

[4]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[7]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[8]  Olivier Bodenreider,et al.  Non-Lexical Approaches to Identifying Associative Relations in the Gene Ontology , 2004, Pacific Symposium on Biocomputing.

[9]  Feichen Shen,et al.  HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology , 2019, J. Biomed. Informatics.

[10]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[11]  Yuh-Jyh Hu,et al.  Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme , 2019, BMC Bioinformatics.

[12]  Yadong Wang,et al.  Identifying cross-category relations in gene ontology and constructing genome-specific term association networks , 2013, BMC Bioinformatics.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  Mark A. Ragan,et al.  Gene Ontology-driven inference of protein-protein interactions using inducers , 2011 .

[15]  Xin Gao,et al.  OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction , 2018, Bioinform..

[16]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[17]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[18]  Yadong Wang,et al.  Identifying term relations cross different gene ontology categories , 2017, BMC Bioinformatics.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[21]  Carlo Zaniolo,et al.  Multifaceted protein–protein interaction prediction based on Siamese residual RCNN , 2019, Bioinform..

[22]  Barry Smith,et al.  Dependence Relationships between Gene Ontology Terms based on TIGR Gene Product Annotations , 2004 .