BACKGROUND: The amount of biomedical literature is rapidly growing and it is becoming increasingly difficult to keep manually curated knowledge bases and ontologies up-to-date. In this study we applied the word2vec deep learning toolkit to medical corpora to test its potential for identifying relationships from unstructured text. We evaluated the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized, unstructured medical text corpora available on the web. Properties included relationships to diseases ('may treat') or physiological processes ('has physiological effect'). We compared the relationships identified by word2vec with manually curated information from the National Drug File - Reference Terminology (NDF-RT) ontology as a gold standard. RESULTS: Our results revealed a maximum accuracy of 49.28% which suggests a limited ability of word2vec to capture linguistic regularities on the collected medical corpora compared with other published results. We were able to document the influence of different parameter settings on result accuracy and found and unexpected trade-off between ranking quality and accuracy. Pre-processing corpora to reduce syntactic variability proved to be a good strategy for increasing the utility of the trained vector models. CONCLUSIONS: Word2vec is a very efficient implementation for computing vector representations and for its ability to identify relationships in textual data without any prior domain knowledge. We found that the ranking and retrieved results generated by word2vec were not of sufficient quality for automatic population of knowledge bases and ontologies, but could serve as a starting point for further manual curation.
[1]
Jeffrey Dean,et al.
Efficient Estimation of Word Representations in Vector Space
,
2013,
ICLR.
[2]
Jeffrey Pennington,et al.
GloVe: Global Vectors for Word Representation
,
2014,
EMNLP.
[3]
Marc'Aurelio Ranzato,et al.
DeViSE: A Deep Visual-Semantic Embedding Model
,
2013,
NIPS.
[4]
Omer Levy,et al.
Linguistic Regularities in Sparse and Explicit Word Representations
,
2014,
CoNLL.
[5]
James M Heilman,et al.
Wikipedia: A Key Tool for Global Public Health Promotion
,
2011,
Journal of medical Internet research.
[6]
Pascal Vincent,et al.
Representation Learning: A Review and New Perspectives
,
2012,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7]
Jun'ichi Tsujii,et al.
Corpus annotation for mining biomedical events from literature
,
2008,
BMC Bioinformatics.