A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.

[1]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[2]  C. Sudlow,et al.  Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group , 2015, PloS one.

[3]  Tomoko Ohkuma,et al.  Overview of the NTCIR-11 MedNLP-2 Task , 2014, NTCIR.

[4]  Ming Zhang,et al.  Automatic classification of diseases from free-text death certificates for real-time surveillance , 2015, BMC Medical Informatics and Decision Making.

[5]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[6]  K. Bretonnel Cohen,et al.  Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016 , 2016, CLEF.

[7]  Hercules Dalianis Clinical Text Retrieval - An Overview of Basic Building Blocks and Applications , 2014, Professional Search in the Modern World.

[8]  S. Ishikawa,et al.  Accuracy of Death Certificates and Assessment of Factors for Misclassification of Underlying Cause of Death. , 2016, Journal of epidemiology.

[9]  Brian J. Smith,et al.  Utility of death certificate data in predicting cancer incidence. , 2014, American journal of industrial medicine.

[10]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[11]  Anthony N. Nguyen,et al.  Automatic ICD-10 classification of cancers from free-text death certificates , 2015, Int. J. Medical Informatics.

[12]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[13]  Dietrich Rebholz-Schuhmann,et al.  Entity Recognition in Parallel Multi-lingual Biomedical Corpora: The CLEF-ER Laboratory Overview , 2013, CLEF.

[14]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[15]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[16]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2018 , 2018, CLEF.

[17]  Tomoko Ohkuma,et al.  Overview of the NTCIR-12 MedNLPDoc Task , 2016, NTCIR.

[18]  Gérard Pavillon,et al.  IRIS: A language-independent coding system based onthe NCHS system MMDS , 2005 .