Instance-based Learning for ICD10 Categorization

In the framework of the CLEF 2018 eHealth campaign, we investigated an instance-based approach for extracting ICD10 codes from death certificates. The 360,000 annotated sentences contained in the training data were indexed with a standard search engine. Then, the k-Nearest Neighbors (k-NN) generated out of an input sentence were exploited in order to infer potential codes, thanks to majority voting. Compared to a standard dictionary-based approach, this simple and robust k-Nearest Neighbors algorithms achieved remarkable good performances (F-Measure 0.79, +13% compared to our dictionary-based approach, +70% compared to the official baseline). This purely statistical approach uses no linguistic knowledge, and could a priori be applied to any language with similar performance levels. The combination of the k-NN with a dictionary-based approach is also a simple way to improve the categorization effectiveness of the system. The reported results are consistent with inter-rater agreements (79-80%) for diagnosis encoding as achieved by trained professional staff. Any significant improvement should therefore be questioned.

[1]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2018 , 2018, CLEF.

[2]  Antoine Geissbühler,et al.  From Episodes of Care to Diagnosis Codes: Automatic Text Categorization for Medico-Economic Encoding , 2008, AMIA.

[3]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[4]  Pierre Zweigenbaum,et al.  CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian , 2018, CLEF.

[5]  Patrick Ruch,et al.  BiTeM Group Report for TREC Medical Records Track 2011 , 2011, TREC.

[6]  Patrick Ruch,et al.  Managing the data deluge , 2013 .

[7]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[8]  Patrick Ruch,et al.  Query and Document Expansion with Medical Subject Headings Terms at Medical Imageclef 2008 , 2008, CLEF.

[9]  Patrick Ruch,et al.  Exploiting incoming and outgoing citations for improving Information Retrieval in the TREC 2015 Clinical Decision Support Track , 2015, TREC.

[10]  Patrick Ruch,et al.  Vocabulary-Driven Passage Retrieval for Question-Answering in Genomics , 2007, TREC.

[11]  Patrick Ruch,et al.  BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2: Multilingual Information Extraction , 2016, CLEF.

[12]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.

[13]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.