Predicting the encoding of secondary diagnoses. An experience based on decision trees

In order to measure the medical activity, hospitals are required to manually encode diagnoses concerning an inpatient episode using the International Classification of Disease (ICD-10). This task is time consuming and requires substantial training for the staff. In this paper, we are proposing an approach able to speed up and facilitate the tedious manual task of coding patient information, especially while coding some secondary diagnoses that are not well described in the medical resources such as discharge letters and medical records. Our approach leverages data mining techniques, and specifically decision trees, in order to explore medical databases that encode such diagnoses knowledge. It uses the stored structured information (age, gender, diagnoses count, medical procedures, etc.) to build a decision tree which assigns the appropriate secondary diagnosis code into the corresponding inpatient episode. We have evaluated our approach on the PMSI database using fine and coarse levels of diagnoses granularity. Three types of experimentations have been performed using different techniques to balance datasets. The results show a significant variation in the evaluation scores between the different techniques for the same studied diagnoses. We highlight the efficiency of the random sampling techniques regardless of the type of diagnoses and the type of measure (F1-measure, recall and precision). RÉSUMÉ. Afin de mesurer l’activité médicale, les hôpitaux sont tenus de coder manuellement des informations concernant les séjours des patients hospitalisés en utilisant la Classification Internationale des Maladies (CIM-10). Cette tâche est chronophage et nécessite une formation importante pour le personnel en particulier pour le codage des diagnostics associés (secondaires). Afin d’assister les personnels hospitaliers dans leur tâche, nous proposons une approche basée sur les techniques de fouille de données et plus précisément les arbres de décision qui permet de Ingénierie des systèmes d’information – n 2/2017, 69-94 70 ISI. Volume 22 – n 2/2017 prédire le codage des diagnostics associés. Les arbres de décision sont construits à partir des données structurées de la base PMSI (âge, sexe, nombre de diagnostics et actes médicaux ...). Ces arbres de décision sont facilement exploitables par un non spécialiste en informatique tel qu’un médecin. Deux niveaux de granularité de diagnostic ont été exploités selon que l’on choisisse de représenter le diagnostic de façon très précise (fin niveau de granularité) ou en se contentant de garder une information plus générale (niveau de granularité plus grossier) correspondant aux catégories de diagnostics. Trois types d’expérimentations ont été réalisés selon différentes techniques d’équilibrage de dataset. Les résultats obtenus indiquent qu’il existe une variation significative des scores d’évaluation entre les différentes techniques pour les mêmes diagnostics étudiés. Nous mettons en évidence l’efficacité des techniques "random sampling" quels que soient le type de diagnostic et le type de mesure (F1-mesure, le rappel et la précision). Nos résultats montrent également l’efficacité d’utiliser le niveau fin de granularité de diagnostic quel que soit le diagnostic étudié.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  Henrique M. G. Martins,et al.  Using Structured EHR Data and SVM to Support ICD-9-CM Coding , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[3]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[4]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[5]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[6]  M. Mostafizur Rahman,et al.  Addressing the Class Imbalance Problem in Medical Datasets , 2013 .

[7]  Jinbo Bi,et al.  Large Scale Diagnostic Code Classification for Medical Patient Records , 2008, IJCNLP.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[10]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[11]  Jinbo Bi,et al.  Automatic medical coding of patient records via weighted ridge regression , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[12]  Kai Ming Ting,et al.  Inducing Cost-Sensitive Trees via Instance Weighting , 1998, PKDD.

[13]  Robert A. Jenders,et al.  A systematic literature review of automated clinical coding and classification systems , 2010, J. Am. Medical Informatics Assoc..

[14]  Fabrizio Angiulli,et al.  Fast condensed nearest neighbor rule , 2005, ICML.

[15]  Ioannis Papaefstathiou,et al.  HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system , 2013, TACO.

[16]  Robert B. Fetter,et al.  Diagnosis Related Groups: Understanding Hospital Performance , 1991 .

[17]  Özlem Uzuner,et al.  Three Approaches to Automatic Assignment of ICD-9-CM Codes to Radiology Reports , 2007, AMIA.

[18]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[19]  Olivier Bodenreider,et al.  From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches , 2007, BioNLP@ACL.

[20]  Soni Jyoti,et al.  Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , 2011 .

[21]  Jennifer G. Dy,et al.  Medical coding classification by leveraging inter-code relationships , 2010, KDD.

[22]  M. Oliveira,et al.  Predicting length of stay and assignment of diagnosis codes during hospital inpatient episodes , 2015 .

[23]  Richárd Farkas,et al.  Automatic construction of rule-based ICD-9-CM coding systems , 2008, BMC Bioinformatics.

[24]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[25]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[26]  F. Cots,et al.  Diagnosis related groups in Europe: moving towards transparency, efficiency, and quality in hospitals? , 2013, BMJ.

[27]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[28]  Cheng G. Weng,et al.  A New Evaluation Measure for Imbalanced Datasets , 2008, AusDM.

[29]  C. Le Guillou,et al.  REFEROCOD: A probabilistic method to medical coding support , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[30]  Richard Nock,et al.  Impact of learning set quality and size on decision tree performances , 2000, Int. J. Comput. Syst. Signals.

[31]  Satheesh Ramachandran,et al.  Inference of Missing ICD 9 Codes Using Text Mining and Nearest Neighbor Techniques , 2012, 2012 45th Hawaii International Conference on System Sciences.

[32]  Stéfan Jacques Darmoni,et al.  Construction of a semi-automated ICD-10 coding help system to optimize medical and economic coding , 2006, MIE.

[33]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[34]  Christian Lovis,et al.  From clinical narratives to ICD codes: automatic text categorization for medico-economic encoding , 2007 .

[35]  Toshio Uchiyama,et al.  AUTOMATIC SELECTION OF DIAGNOSIS PROCEDURE COMBINATION CODES BASED ON PARTIAL TREATMENT DATA RELATIVE TO THE NUMBER OF HOSPITALIZATION DAYS , 2018 .

[36]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[37]  Henrique M. G. Martins,et al.  Clinical coding support based on structured data stored in electronic health records , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[38]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[39]  Jérôme Azé,et al.  Recherche et visualisation de trajectoires dans les parcours de soins des patients ayant eu un infarctus du myocarde , 2015 .

[40]  Régis Beuscart,et al.  Improvement of the quality of medical databases: data-mining-based prediction of diagnostic codes from previous patient codes , 2015, MIE.

[41]  Rémi Bastide,et al.  Increasing Alertness while Coding Secondary Diagnostics in the Medical Record , 2016, HEALTHINF.