Active deep learning to detect demographic traits in free-form clinical notes

The free-form portions of clinical notes are a significant source of information for research, but before they can be used, they must be de-identified to protect patients' privacy. De-identification efforts have focused on known identifier types (names, ages, dates, addresses, ID's, etc.). However, a note can contain residual "Demographic Traits" (DTs), unique enough to re-identify the patient when combined with other such facts. Here we examine whether any residual risks remain after removing these identifiers. After manually annotating over 140,000 words worth of medical notes, we found no remaining directly identifying information, and a low prevalence of demographic traits, such as marital status or housing type. We developed an annotation guide to the discovered Demographic Traits (DTs) and used it to label MIMIC-III and i2b2-2006 clinical notes as test sets. We then designed a "bootstrapped" active learning iterative process for identifying DTs: we tentatively labeled as positive all sentences in the DT-rich note sections, used these to train a binary classifier, manually corrected acute errors, and retrained the classifier. This train-and-correct process may be iterated. Our active learning process significantly improved the classifier's accuracy. Moreover, our BERT-based model outperformed non-neural models when trained on both tentatively labeled data and manually relabeled examples. To facilitate future research and benchmarking, we also produced and made publicly available our human annotated DT-tagged datasets. We conclude that directly identifying information is virtually non-existent in the multiple medical note types we investigated. Demographic traits are present in medical notes, but can be detected with high accuracy using a cost-effective human-in-the-loop active learning process, and redacted if desired.2.

[1]  Xiaolong Wang,et al.  De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[2]  Yaoyun Zhang,et al.  A hybrid approach to automatic de-identification of psychiatric notes. , 2017, Journal of biomedical informatics.

[3]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[4]  Stéphane M. Meystre,et al.  Text de-identification for privacy protection: A study of its impact on clinical text information content , 2014, J. Biomed. Informatics.

[5]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[6]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[7]  Tianyong Hao,et al.  A bibliometric analysis of natural language processing in medical research , 2018, BMC Medical Informatics and Decision Making.

[8]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[12]  Andreas Holzinger,et al.  Interactive machine learning for health informatics: when do we need the human-in-the-loop? , 2016, Brain Informatics.

[13]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[16]  John R. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004 .

[17]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.