BioWordVec, improving biomedical word embeddings with subword information and MeSH

Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.Design Type(s)data transformation objective • data integration objective • text processing and analysis objectiveMeasurement Type(s)word representationTechnology Type(s)Text_MiningFactor Type(s)Sample Characteristic(s)Machine-accessible metadata file describing the reported data (ISA-Tab format)

[1]  Neil R. Smalheiser,et al.  Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings , 2019, J. Biomed. Informatics.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Neil R. Smalheiser,et al.  Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings , 2018, ArXiv.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Yifan Peng,et al.  Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models , 2018, ArXiv.

[7]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[8]  Herrero-ZazoMaría,et al.  The DDI corpus , 2013 .

[9]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[10]  Yifan Peng,et al.  Extracting chemical–protein relations with ensembles of SVM and deep learning models , 2018, Database J. Biol. Databases Curation.

[11]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[12]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[13]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[14]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[15]  Zhiyong Lu,et al.  Sentence Similarity Measures Revisited: Ranking Sentences in PubMed Documents , 2018, BCB.

[16]  Yifan Peng,et al.  BioC-compatible full-text passage detection for protein–protein interactions using extended dependency graph , 2016, Database J. Biol. Databases Curation.

[17]  Jari Björne,et al.  PubMed-Scale Event Extraction for Post-Translational Modifications, Epigenetics and Protein Structural Relations , 2012, BioNLP@HLT-NAACL.

[18]  Fabio Rinaldi,et al.  Strategies towards digital and semi-automated curation in RegulonDB , 2017, Database J. Biol. Databases Curation.

[19]  Terrence Adam,et al.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[20]  Jun Zhao,et al.  Relation Classification via Convolutional Deep Neural Network , 2014, COLING.

[21]  Paloma Martínez,et al.  Lessons learnt from the DDIExtraction-2013 Shared Task , 2014, J. Biomed. Informatics.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[24]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[25]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[26]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[27]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[28]  Hongfei Lin,et al.  Drug drug interaction extraction from biomedical literature using syntax convolutional neural network , 2016, Bioinform..

[29]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[30]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[31]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[32]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[33]  Wei Zheng,et al.  Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths , 2017, Bioinform..

[34]  Zhiyuan Liu,et al.  Joint Representation Learning of Text and Knowledge for Knowledge Graph Completion , 2016, ArXiv.

[35]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[36]  Xu Chen,et al.  Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding , 2017, ACL.

[37]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.