Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer’s disease

Extracting useful information from biomedical literature plays an important role in the development of modern medicine. In natural language processing, there have been rigorous attempts to find meaningful relationships between entities automatically by co-occurrence-based methods. It has been increasingly important to understand whether relationships exist, and if so how strong, between any two entities extracted from a large number of texts. One of the defining methods is to measure semantic similarity and relatedness between two entities. We propose a hybrid ranking method that combines a co-occurrence approach considering both direct and indirect entity pair relationship with specialized word embeddings for measuring the relatedness of two entities. We evaluate the proposed ranking method comparatively with other well-known methods such as co-occurrence, Word2Vec, COALS (Correlated Occurrence Analog to Lexical Semantics), and random indexing by calculating top-ranked entities related to Alzheimer’s disease. In addition, we analyze gene, pathway, and gene–phenotype relationships. Overall, the proposed method tends to find more hidden relationships than the other methods. Our proposed method is able to select more useful related entities that not only highly co-occur but also have more indirect relations for the target entity. In pathway analysis, our proposed method shows superior performance at identifying (functional) cross clustering and higher-level pathways. Our proposed method, resulting from phenotype analysis, has an advantage in identifying the common genotype relating to phenotypes from biological literature.

[1]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[2]  Stephen Clark,et al.  Specializing Word Embeddings for Similarity or Relatedness , 2015, EMNLP.

[3]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[4]  J. Trojanowski,et al.  Tau-mediated neurodegeneration in Alzheimer's disease and related disorders , 2007, Nature Reviews Neuroscience.

[5]  Russ B. Altman,et al.  Improving the Prediction of Pharmacogenes Using Text-Derived Gene-Drug Relationships , 2010, Pacific Symposium on Biocomputing.

[6]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[7]  W. Bruce Croft,et al.  Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2011, SIGIR.

[8]  Jimmy J. Lin,et al.  Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement , 2016, NAACL.

[9]  Roman Grundkiewicz,et al.  Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , 2015, EMNLP 2015.

[10]  D. Holtzman,et al.  The Role of Apolipoprotein E in Alzheimer's Disease , 2009, Neuron.

[11]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  Arturas Petronis,et al.  Phenotypic differences in genetically identical organisms: the epigenetic perspective. , 2005, Human molecular genetics.

[14]  Sebastiaan Engelborghs,et al.  Alzheimer’s disease CSF biomarkers: clinical indications and rational use , 2017, Acta Neurologica Belgica.

[15]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[16]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[17]  Alessandro Moschitti,et al.  Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks , 2015, SIGIR.

[18]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[19]  Ani Nenkova,et al.  Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016, NAACL 2016.

[20]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[21]  Lin Li,et al.  A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach , 2018, Bioinform..

[22]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[23]  Min Song,et al.  PKDE4J: Entity and relation extraction for public knowledge discovery , 2015, J. Biomed. Informatics.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[26]  Christopher C. Yang,et al.  Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium , 2012, IHI 2012.

[27]  Erhard W. Hinrichs,et al.  Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 , 2003 .

[28]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[29]  Cynthia Brandt,et al.  Semantic similarity in the biomedical domain: an evaluation across knowledge sources , 2012, BMC Bioinformatics.

[30]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..