Evaluating Neural Text Simplification in the Medical Domain

Health literacy, i.e. the ability to read and understand medical text, is a relevant component of public health. Unfortunately, many medical texts are hard to grasp by the general population as they are targeted at highly-skilled professionals and use complex language and domain-specific terms. Here, automatic text simplification making text commonly understandable would be very beneficial. However, research and development into medical text simplification is hindered by the lack of openly available training and test corpora which contain complex medical sentences and their aligned simplified versions. In this paper, we introduce such a dataset to aid medical text simplification research. The dataset is created by filtering aligned health sentences using expert knowledge from an existing aligned corpus and a novel simple, language independent monolingual text alignment method. Furthermore, we use the dataset to train a state-of-the-art neural machine translation model, and compare it to a model trained on a general simplification dataset using an automatic evaluation, and an extensive human-expert evaluation.

[1]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[2]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[3]  WORD CO-OCCURRENCE AND THEORY OF MEANING , 2005 .

[4]  Matthew Shardlow,et al.  A Survey of Automated Text Simplification , 2014 .

[5]  David Kauchak,et al.  Effects on Text Simplification: Evaluation of Splitting Up Noun Phrases , 2016, Journal of health communication.

[6]  Lucia Specia,et al.  Unsupervised Lexical Simplification for Non-Native Speakers , 2016, AAAI.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Hong Yu,et al.  A Natural Language Processing System That Links Medical Terms in Electronic Health Record Notes to Lay Definitions: System Development Using Physician Reviews , 2018, Journal of medical Internet research.

[9]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[10]  Matthew Shardlow,et al.  The CW Corpus: A New Resource for Evaluating the Identification of Complex Words , 2013, PITR@ACL.

[11]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[12]  Joachim Bingel,et al.  Text Simplification as Tree Labeling , 2016, ACL.

[13]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[14]  Oladimeji Farri,et al.  Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification , 2018, KHD@IJCAI.

[15]  Nadee Goonawardene,et al.  Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review , 2017, Journal of medical Internet research.

[16]  S. Chinnadurai,et al.  Readability of online patient education materials for velopharyngeal insufficiency. , 2018, International journal of pediatric otorhinolaryngology.

[17]  Paloma Martínez,et al.  Exploring Spanish health social media for detecting drug effects , 2015, BMC Medical Informatics and Decision Making.

[18]  Kai Zheng,et al.  Mining Consumer Health Vocabulary from Community-Generated Text , 2014, AMIA.

[19]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[20]  Advaith Siddharthan,et al.  Text simplification using synchronous dependency grammars: Generalising automatically harvested rules , 2014, INLG.

[21]  John Lee,et al.  Splitting Complex English Sentences , 2017, IWPT.

[22]  Andrew Steptoe,et al.  Association between low functional health literacy and mortality in older adults: longitudinal cohort study , 2012, BMJ : British Medical Journal.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Gondy Leroy,et al.  Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty , 2012, IHI '12.

[25]  Ming Yang,et al.  Extracting Consumer Health Expressions of Drug Safety from Web Forum , 2015, 2015 48th Hawaii International Conference on System Sciences.

[26]  Byron C. Wallace,et al.  Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision , 2016, J. Mach. Learn. Res..

[27]  Gustavo Henrique Paetzold,et al.  A survey of lexical simplification , 2018, Emerging Trends in Engineering, Science and Technology for Society, Energy and Environment.

[28]  Advaith Siddharthan,et al.  Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules , 2014, EACL.

[29]  Paloma Martínez,et al.  Simplifying drug package leaflets written in Spanish by using word embedding , 2017, Journal of Biomedical Semantics.

[30]  Iryna Gurevych,et al.  Metaheuristic Approaches to Lexical Substitution and Simplification , 2017, EACL.

[31]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[32]  Sanghee Oh,et al.  Enriching consumer health vocabulary through mining a social Q&A site: A similarity-based approach , 2017, J. Biomed. Informatics.

[33]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[34]  Hong Yu,et al.  Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations , 2016, JMIR medical informatics.

[35]  Joachim Bingel,et al.  Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs , 2017, IJCNLP.

[36]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[37]  Michael Hogarth,et al.  Text Simplification Using Consumer Health Vocabulary to Generate Patient-Centered Radiology Reporting: Translation and Evaluation , 2017, Journal of medical Internet research.

[38]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[39]  Ari Rappoport,et al.  Simple and Effective Text Simplification Using Semantic and Neural Methods , 2018, ACL.

[40]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[41]  Wei Wu,et al.  Aligning Sentences from Standard Wikipedia to Simple Wikipedia , 2015, NAACL.

[42]  Maria Kvist,et al.  Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language , 2014, PITR@EACL.

[43]  Christian Biemann,et al.  CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups , 2017, IJCNLP.

[44]  Lucia Specia,et al.  Lexical Simplification with Neural Ranking , 2017, EACL.

[45]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[46]  Goran Glavas,et al.  Simplifying Lexical Simplification: Do We Need Simplified Corpora? , 2015, ACL.

[47]  Sanja Stajner,et al.  A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation , 2015, ACL.

[48]  Siobhan Devlin,et al.  Helping aphasic people process online information , 2006, Assets '06.

[49]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[50]  Daniel Ferrés,et al.  YATS: Yet Another Text Simplifier , 2016, NLDB.

[51]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[52]  Goran Glavas,et al.  Leveraging event-based semantics for automated text simplification , 2017, Expert Syst. Appl..

[53]  P. Kushalnagar,et al.  Making Cancer Health Text on the Internet Easier to Read for Deaf People Who Use American Sign Language , 2018, Journal of Cancer Education.

[54]  Hong Yu,et al.  Ranking Medical Terms to Support Expansion of Lay Language Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach , 2017, JMIR medical informatics.

[55]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[56]  Christophe Servan,et al.  Domain specialization: a post-training domain adaptation for Neural Machine Translation , 2016, ArXiv.

[57]  Karl-Heinz Zimmermann,et al.  D-Bees: A novel method inspired by bee colony optimization for solving word sense disambiguation , 2014, Swarm Evol. Comput..

[58]  Georgios Paliouras,et al.  Biomedical Semantic Indexing using Dense Word Vectors in BioASQ , 2015 .

[59]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[60]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[61]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[62]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[63]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[64]  Bo Liu,et al.  Neural Clinical Paraphrase Generation with Attention , 2016, ClinicalNLP@COLING 2016.

[65]  Juan E. Gilbert,et al.  Examining the effect of automated health explanations on older adults' attitudes toward medication information , 2016, PervasiveHealth.

[66]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[67]  Mamoru Komachi,et al.  Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings , 2016, COLING.

[68]  Sara Tonelli,et al.  MUSST: A Multilingual Syntactic Simplification Tool , 2017, IJCNLP.

[69]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[70]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[71]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[72]  J. Protheroe,et al.  A mismatch between population health literacy and the complexity of health information: an observational study. , 2015, The British journal of general practice : the journal of the Royal College of General Practitioners.

[73]  David Kauchak,et al.  NegAIT: A new parser for medical text simplification using morphological, sentential and double negation , 2017, J. Biomed. Informatics.

[74]  Hong Yu,et al.  Unsupervised Ensemble Ranking of Terms in Electronic Health Record Notes Based on Their Importance to Patients , 2017, J. Biomed. Informatics.

[75]  Tadashi Nomoto,et al.  Lexico-syntactic text simplification and compression with typed dependencies , 2014, COLING.

[76]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[77]  Christopher C. Yang,et al.  Expanding Consumer Health Vocabularies by Learning Consumer Health Expressions from Online Health Social Media , 2015, SBP.

[78]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[79]  Luca Soldaini QuickUMLS: a fast, unsupervised approach for medical concept extraction , 2016 .

[80]  Lucia Specia,et al.  SemEval 2016 Task 11: Complex Word Identification , 2016, *SEMEVAL.

[81]  Ari Rappoport,et al.  Universal Conceptual Cognitive Annotation (UCCA) , 2013, ACL.

[82]  Ari Rappoport,et al.  A Transition-Based Directed Acyclic Graph Parser for UCCA , 2017, ACL.

[83]  Daniel Ferrés,et al.  A Web-based Text Simplification System for English , 2015, Proces. del Leng. Natural.

[84]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[85]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[86]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.