Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages

Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen’s terms refer to a particular medical concept (i.e. text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.

[1]  Xiaoyan Wang,et al.  Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[2]  Nigel Collier,et al.  Towards the Semantic Interpretation of Personal Health Messages from Social Media , 2015, UCUI@CIKM.

[3]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[6]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Nick Craswell Mean Reciprocal Rank , 2009, Encyclopedia of Database Systems.

[9]  Sebastian Schneeweiss,et al.  Comparative Safety of Antidepressant Agents for Children and Adolescents Regarding Suicidal Acts , 2010, Pediatrics.

[10]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[11]  Dario A. Giuse,et al.  Development and evaluation of RapTAT: A machine learning system for concept mapping of phrases from medical narratives , 2014, J. Biomed. Informatics.

[12]  Kent A. Spackman,et al.  SNOMED RT: a reference terminology for health care , 1997, AMIA.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[15]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[16]  Peter L. Elkin,et al.  Comparison of Natural Language Processing Biosurveillance Methods for Identifying Influenza From Encounter Notes , 2012, Annals of Internal Medicine.

[17]  Michael D. Barnes,et al.  Tweaking and Tweeting: Exploring Twitter for Nonmedical Use of a Psychostimulant Drug (Adderall) Among College Students , 2013, Journal of medical Internet research.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Graciela Gonzalez-Hernandez,et al.  Pharmacovigilance on Twitter? Mining Tweets for Adverse Drug Reactions , 2014, AMIA.