ECSTRA-INSERM @ CLEF eHealth2016-task 2: ICD10 Code Extraction from Death Certificates

This paper describes the participation of ECSTRA-INSERM team at CLEF eHealth 2016, task 2.C. The task involves extracting ICD10 codes from death certificates, mainly described with short plain texts. We cast the task as a machine learning problem involving the prediction of the ICD10 codes (categorical variable) from the raw text transformed into a bag-of-words matrix. We rely on probabilistic topic models that we evaluate against classical classifiers such as SVM and Naive Bayes. We demonstrate the effectiveness of topic models for this task in terms of prediction accuracy and result interpretation.

[1]  Özlem Uzuner,et al.  Three Approaches to Automatic Assignment of ICD-9-CM Codes to Radiology Reports , 2007, AMIA.

[2]  Doug Downey,et al.  Efficient Methods for Incorporating Knowledge into Topic Models , 2015, EMNLP.

[3]  Julien Velcin,et al.  Supervised Topic Models for Diagnosis Code Assignment to Discharge Summaries , 2016, CICLing.

[4]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[5]  James Allan,et al.  A Comparative Study of Utilizing Topic Models for Information Retrieval , 2009, ECIR.

[6]  Jinbo Bi,et al.  Large Scale Diagnostic Code Classification for Medical Patient Records , 2008, IJCNLP.

[7]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[10]  K. Bretonnel Cohen,et al.  Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016 , 2016, CLEF.

[11]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[12]  Yitao Zhang A Hierarchical Approach to Encoding Medical Concepts for Clinical Notes , 2008, ACL.

[13]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[14]  Philip Resnik,et al.  GIBBS SAMPLING FOR THE UNINITIATED , 2010 .

[15]  Yi Yang Northwestern Incorporating User Input with Topic Modeling , 2014 .

[16]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[17]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[18]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[19]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.