Deep learning models are not robust against noise in clinical text

Artificial Intelligence (AI) systems are attracting increasing interest in the medical domain due to their ability to learn complicated tasks that require human intelligence and expert knowledge. AI systems that utilize high-performance Natural Language Processing (NLP) models have achieved state-of-the-art results on a wide variety of clinical text processing benchmarks. They have even outperformed human accuracy on some tasks. However, performance evaluation of such AI systems have been limited to accuracy measures on curated and clean benchmark datasets that may not properly reflect how robustly these systems can operate in real-world situations. In order to address this challenge, we introduce and implement a wide variety of perturbation methods that simulate different types of noise and variability in clinical text data. While noisy samples produced by these perturbation methods can often be understood by humans, they may cause AI systems to make erroneous decisions. Conducting extensive experiments on several clinical text processing tasks, we evaluated the robustness of high-performance NLP models against various types of character-level and word-level noise. The results revealed that the NLP models performance degrades when the input contains small amounts of noise. This study is a significant step towards exposing vulnerabilities of AI models utilized in clinical text processing systems. The proposed perturbation methods can be used in performance evaluation tests to assess how robustly clinical NLP models can operate on noisy data, in real-world settings.

[1]  Zhiyong Lu,et al.  Challenges in clinical natural language processing for automated disorder normalization , 2015, J. Biomed. Informatics.

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[4]  Sumit Soman,et al.  Deep learning for health informatics: Recent trends and future directions , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[5]  Kexin Huang,et al.  Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation , 2019, CLINICALNLP.

[6]  Zhiyuan Liu,et al.  OpenAttack: An Open-source Textual Adversarial Attack Toolkit , 2020, ACL.

[7]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[8]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[9]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[10]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[11]  Yonghui Wu,et al.  Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models , 2020, JMIR medical informatics.

[12]  Nilanjan Dey,et al.  A Survey of Data Mining and Deep Learning in Bioinformatics , 2018, Journal of Medical Systems.

[13]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[14]  Parisa Rashidi,et al.  Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis , 2017, IEEE Journal of Biomedical and Health Informatics.

[15]  Alexey Romanov,et al.  Lessons from Natural Language Inference in the Clinical Domain , 2018, EMNLP.

[16]  Sunil Kumar Sahu,et al.  Relation extraction from clinical texts using domain invariant convolutional neural network , 2016, BioNLP@ACL.

[17]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[18]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[19]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[20]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  Quan Z. Sheng,et al.  Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey , 2019 .

[22]  Feichen Shen,et al.  The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview , 2020, JMIR medical informatics.

[23]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[24]  Dan Jurafsky,et al.  Utility is in the Eye of the User: A Critique of NLP Leaderboards , 2020, EMNLP.

[25]  Karin M. Verspoor,et al.  Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[26]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[27]  Bo Zhao,et al.  Deep learning in clinical natural language processing: a methodical review , 2019, J. Am. Medical Informatics Assoc..

[28]  Yaohang Li,et al.  Clinical big data and deep learning: Applications, challenges, and future outlooks , 2019, Big Data Min. Anal..