Mapping Unparalleled Clinical Professional and Consumer Languages with Embedding Alignment

Mapping and translating professional but arcane clinical jargons to consumer language is essential to improve the patient-clinician communication. Researchers have used the existing biomedical ontologies and consumer health vocabulary dictionary to translate between the languages. However, such approaches are limited by expert efforts to manually build the dictionary, which is hard to be generalized and scalable. In this work, we utilized the embeddings alignment method for the word mapping between unparalleled clinical professional and consumer language embeddings. To map semantically similar words in two different word embeddings, we first independently trained word embeddings on both the corpus with abundant clinical professional terms and the other with mainly healthcare consumer terms. Then, we aligned the embeddings by the Procrustes algorithm. We also investigated the approach with the adversarial training with refinement. We evaluated the quality of the alignment through the similar words retrieval both by computing the model precision and as well as judging qualitatively by human. We show that the Procrustes algorithm can be performant for the professional consumer language embeddings alignment, whereas adversarial training with refinement may find some relations between two languages.

[1]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[2]  A. Barratt,et al.  Words do matter: a systematic review on how different terminology for the same condition influences management preferences , 2017, BMJ Open.

[3]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[4]  J. Jansen,et al.  Influence of the disease label ‘polycystic ovary syndrome’ on intention to have an ultrasound and psychosocial outcomes: a randomised online study in young women , 2017, Human reproduction.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[9]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[10]  Kai Zheng,et al.  Mining Consumer Health Vocabulary from Community-Generated Text , 2014, AMIA.

[11]  Franck Dernoncourt,et al.  Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks , 2016, NAACL.

[12]  Giosuè Lo Bosco,et al.  An Automatic System for Helping Health Consumers to Understand Medical Texts , 2015, HEALTHINF.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Alla Keselman,et al.  Making Texts in Electronic Health Records Comprehensible to Consumers: A Prototype Translator , 2007, AMIA.

[16]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[17]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[18]  Rita D. Zielstorff,et al.  Controlled vocabularies for consumer health , 2003, J. Biomed. Informatics.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Gondy Leroy,et al.  Research Paper: Consumer Health Concepts That Do Not Map to the UMLS: Where Do They Fit? , 2008, J. Am. Medical Informatics Assoc..

[22]  Kavishwar B. Wagholikar,et al.  Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach , 2017, BMC Medical Informatics and Decision Making.

[23]  L. Esserman,et al.  Impact of ductal carcinoma in situ terminology on patient treatment preferences. , 2013, JAMA internal medicine.

[24]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[25]  James R. Glass,et al.  Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[26]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[27]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.