Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes

ICD coding from electronic clinical records is a manual, time-consuming and expensive process. Code assignment is, however, an important task for billing purposes and database organization. While many works have studied the problem of automated ICD coding from free text using machine learning techniques, most use records in the English language, especially from the MIMIC-III public dataset. This work presents results for a dataset with Brazilian Portuguese clinical notes. We develop and optimize a Logistic Regression model, a Convolutional Neural Network (CNN), a Gated Recurrent Unit Neural Network and a CNN with Attention (CNN-Att) for prediction of diagnosis ICD codes. We also report our results for the MIMIC-III dataset, which outperform previous work among models of the same families, as well as the state of the art. Compared to MIMIC-III, the Brazilian Portuguese dataset contains far fewer words per document, when only discharge summaries are used. We experiment concatenating additional documents available in this dataset, achieving a great boost in performance. The CNN-Att model achieves the best results on both datasets, with micro-averaged F1 score of 0.537 on MIMIC-III and 0.485 on our dataset with additional documents.

[1]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[2]  Anthony R. Davis,et al.  A Method for Modeling Co-Occurrence Propensity of Clinical Codes with Application to ICD-10-PCS Auto-Coding , 2015, J. Am. Medical Informatics Assoc..

[3]  Pengtao Xie,et al.  A Neural Architecture for Automated ICD Coding , 2018, ACL.

[4]  Brämer Gr International statistical classification of diseases and related health problems. Tenth revision. , 1988, World health statistics quarterly. Rapport trimestriel de statistiques sanitaires mondiales.

[5]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[6]  W. Bruce Croft,et al.  Automatic Assignment of ICD9 Codes To Discharge Summaries , 1995 .

[7]  Sandeep Ayyar,et al.  Tagging Patient Notes With ICD-9 Codes , 2017 .

[8]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[9]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[10]  Marcelo Finger,et al.  Automated Classification of Semi-Structured Pathology Reports into ICD-O Using SVM in Portuguese. , 2017, Studies in health technology and informatics.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Henrique M. G. Martins,et al.  Using Structured EHR Data and SVM to Support ICD-9-CM Coding , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[13]  Koby Crammer,et al.  Automatic Code Assignment to Medical Text , 2007, BioNLP@ACL.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Mário J. Silva,et al.  Deep neural models for ICD-10 coding of death certificates and autopsy reports in free-text , 2018, J. Biomed. Informatics.

[18]  Berthier A. Ribeiro-Neto,et al.  A hierarchical approach to the automatic categorization of medical documents , 1998, CIKM '98.

[19]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Jinmiao Huang,et al.  An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment using MIMIC-III Clinical Notes , 2018, Comput. Methods Programs Biomed..

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[24]  Jimeng Sun,et al.  Explainable Prediction of Medical Codes from Clinical Text , 2018, NAACL.

[25]  Pengtao Xie,et al.  Multimodal Machine Learning for Automated ICD Coding , 2018, MLHC.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[29]  Fei Li,et al.  ICD Coding from Clinical Text Using Multi-Filter Residual Convolutional Neural Network , 2019, AAAI.

[30]  Cédrick Fairon,et al.  Machine learning and features selection for semi-automatic ICD-9-CM encoding , 2010, Louhi@NAACL-HLT.

[31]  Yi Pan,et al.  Automated ICD-9 Coding via A Deep Learning Approach , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Antoine Geissbühler,et al.  From Episodes of Care to Diagnosis Codes: Automatic Text Categorization for Medico-Economic Encoding , 2008, AMIA.

[33]  Noémie Elhadad,et al.  Multi-Label Classification of Patient Notes: Case Study on ICD Code Assignment , 2018, AAAI Workshops.

[34]  Pengtao Xie,et al.  Convolutional Neural Networks for Medical Diagnosis from Admission Notes , 2017, ArXiv.