Substituting clinical features using synthetic medical phrases: Medical text data augmentation techniques

Biomedical natural language processing (NLP) has an important role in extracting consequential information in medical discharge notes. Detecting meaningful features from unstructured notes is a challenging task in medical document classification. The domain specific phrases and different synonyms within the medical documents make it hard to analyze them. Analyzing clinical notes becomes more challenging for short documents like abstract texts. All of these can result in poor classification performance, especially when there is a shortage of the clinical data in real life. Two new approaches (an ontology-guided approach and a combined ontology-based with dictionary-based approach) are suggested for augmenting medical data to enrich training data. Three different deep learning approaches are used to evaluate the classification performance of the proposed methods. The obtained results show that the proposed methods improved the classification accuracy in clinical notes classification.

[1]  Aleksander Smywinski-Pohl,et al.  Towards textual data augmentation for neural networks: synonyms and maximum loss , 2019, Comput. Sci..

[2]  Shang Gao,et al.  Hierarchical attention networks for information extraction from cancer pathology reports , 2017, J. Am. Medical Informatics Assoc..

[3]  M. Aono,et al.  Ontology based Approach for Classifying Biomedical Text Abstracts , 2011 .

[4]  Jinyan Li,et al.  A Dictionary-based Oversampling Approach to Clinical Document Classification on Small and Imbalanced Dataset , 2020, 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[5]  Claude Coulombe,et al.  Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs , 2018, ArXiv.

[6]  Lovekesh Vig,et al.  TimeNet: Pre-trained deep recurrent neural network for time series classification , 2017, ESANN.

[7]  Domonkos Tikk,et al.  Research Paper: Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier , 2009, J. Am. Medical Informatics Assoc..

[8]  Aaron M. Cohen,et al.  Research Paper: A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection , 2009, J. Am. Medical Informatics Assoc..

[9]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[10]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[11]  Tipu Z. Aziz,et al.  Prediction of Parkinson's Disease tremor Onset Using a Radial Basis Function Neural Network Based on Particle Swarm Optimization , 2010, Int. J. Neural Syst..

[12]  Yi Mei,et al.  Stratifying Risk of Coronary Artery Disease Using Discriminative Knowledge-Guided Medical Concept Pairings from Clinical Notes , 2019, PRICAI.

[13]  Rabie A. Ramadan,et al.  Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data , 2021, IEEE Access.

[14]  Cynthia Brandt,et al.  Ontology-guided feature engineering for clinical text classification , 2012, J. Biomed. Informatics.

[15]  Eric Fosler-Lussier,et al.  Comparison of UMLS terminologies to identify risk of heart disease using clinical notes , 2015, J. Biomed. Informatics.

[16]  David Sánchez,et al.  Utility-preserving privacy protection of textual healthcare documents , 2014, J. Biomed. Informatics.

[17]  Xiaoying Gao,et al.  Ontology-Guided Data Augmentation for Medical Document Classification , 2020, AIME.

[18]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[19]  Özlem Uzuner,et al.  Automatic prediction of coronary artery disease from clinical narratives , 2017, J. Biomed. Informatics.

[20]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[21]  Yi Mei,et al.  An Ontology-based Two-Stage Approach to Medical Text Classification with Feature Selection by Particle Swarm Optimisation , 2019, 2019 IEEE Congress on Evolutionary Computation (CEC).

[22]  Lovekesh Vig,et al.  ODE - Augmented Training Improves Anomaly Detection in Sensor Data from Machines , 2016, ArXiv.

[23]  Ahmet Aker,et al.  Assigning Terms to Domains by Document Classification , 2014 .

[24]  Foram P. Shah,et al.  A review on feature selection and feature extraction for text classification , 2016, 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[25]  Kavishwar B. Wagholikar,et al.  Modeling Paradigms for Medical Diagnostic Decision Support: A Survey and Future Directions , 2012, Journal of Medical Systems.

[26]  Mark D. McDonnell,et al.  Understanding Data Augmentation for Classification: When to Warp? , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[27]  Illhoi Yoo,et al.  Data Mining in Healthcare and Biomedicine: A Survey of the Literature , 2012, Journal of Medical Systems.

[28]  Russell C. Eberhart,et al.  Human tremor analysis using particle swarm optimization , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[29]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[30]  Yi Mei,et al.  Uncovering Discriminative Knowledge-Guided Medical Concepts for Classifying Coronary Artery Disease Notes , 2018, Australasian Conference on Artificial Intelligence.

[31]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..