Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition

OBJECTIVE : Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. MATERIALS AND METHODS : We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. RESULTS : Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. CONCLUSIONS : Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.

[1]  Hua Xu,et al.  Research and applications: Assisted annotation of medical free text using RapTAT , 2014, J. Am. Medical Informatics Assoc..

[2]  Hua Xu,et al.  A hybrid system for temporal information extraction from clinical text , 2013, J. Am. Medical Informatics Assoc..

[3]  Goran Nenadic,et al.  Clinical Text Data in Machine Learning: Systematic Review , 2020, JMIR medical informatics.

[4]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[5]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[6]  Michele Filannino,et al.  2018 N2c2 Shared Task on Adverse Drug Events and Medication Extraction in Electronic Health Records , 2020, J. Am. Medical Informatics Assoc..

[7]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[8]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[9]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[10]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[11]  Ming Yang,et al.  Entity recognition from clinical texts via recurrent neural network , 2017, BMC Medical Informatics and Decision Making.

[12]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[13]  M. Douglass,et al.  Computer-assisted de-identification of free text in the MIMIC II database , 2004, Computers in Cardiology, 2004.

[14]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[17]  Massimo Piccardi,et al.  Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition , 2017, J. Biomed. Informatics.

[18]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[19]  Hongfang Liu,et al.  CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines , 2017, J. Am. Medical Informatics Assoc..

[20]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[21]  Donia Scott,et al.  Extracting information from the text of electronic medical records to improve case detection: a systematic review , 2016, J. Am. Medical Informatics Assoc..

[22]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[23]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[24]  Hyunjung Shin,et al.  Disease causality extraction based on lexical semantics and document-clause frequency from biomedical literature , 2017, BMC Medical Informatics and Decision Making.