Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction

Deep learning has surged in popularity and proven to be effective for various artificial intelligence applications including information extraction from cancer pathology reports. Since word representation is a core unit that enables deep learning algorithms to understand words and be able to perform NLP, this representation must include as much information as possible to help these algorithms achieve high classification performance. Therefore, in this work in addition to the distributional information of words in large sized corpora, we use UMLS vocabulary resources to enrich the vector space representation of words with the semantic relations between words. These resources provide many terminologies pertaining to cancer. The refined word embeddings are used with a convolutional neural (CNN) model to extract four data elements from cancer pathology reports; ICD-O-3 tumor topography codes, tumor laterality, behavior, and histological grade. We observed that using UMLS vocabulary resources to enrich word embeddings of CNN models consistently outperformed CNN models without pre-training word embeddings and even with pre-trained word embeddings on a domain specific corpus across all four tasks. The results show marginal improvement on the laterality task, but a significant improvement on the other tasks, especially for the macro-f score. Specifically, the improvements are 3%, 13%, and 15% for tumor site, histological grade, and behavior tasks, respectively. This approach is encouraging to enrich word embeddings with more clinical data resources to be used for information abstraction tasks from clinical pathology reports.

[1]  Manuel R. Vargas,et al.  Deep learning for stock market prediction from financial news articles , 2017, 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA).

[2]  Pietro Liò,et al.  Learning Rare Word Representations using Semantic Bridging , 2017, ArXiv.

[3]  Hong-Jun Yoon,et al.  Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports , 2016, INNS Conference on Big Data.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Peter Szolovits,et al.  Automatic lymphoma classification with sentence subgraph mining from pathology reports. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[6]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[7]  Mark Dredze,et al.  Improving Lexical Embeddings with Semantic Knowledge , 2014, ACL.

[8]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[9]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Dilek Z. Hakkani-Tür,et al.  Enriching Word Embeddings Using Knowledge Graph for Semantic Tagging in Conversational Dialog Systems , 2015, AAAI Spring Symposia.

[12]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[13]  Ali Ghodsi,et al.  Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis , 2017, ArXiv.

[14]  Anthony N. Nguyen,et al.  Automatic ICD-10 classification of cancers from free-text death certificates , 2015, Int. J. Medical Informatics.

[15]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[16]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[17]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[20]  Bowen Zhou,et al.  Good, Better, Best: Choosing Word Embedding Context , 2015, ArXiv.

[21]  Hong-Jun Yoon,et al.  Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.