Automatic ICD-10 classification of cancers from free-text death certificates

OBJECTIVE Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates--an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. METHODS Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. RESULTS The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. CONCLUSION The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.

[1]  Anthony N. Nguyen,et al.  Automatic Extraction of Cancer Characteristics from Free-Text Pathology Reports for Cancer Notifications , 2011, HIC.

[2]  Anthony N. Nguyen,et al.  Classification of cancer-related death certificates using machine learning. , 2013, The Australasian medical journal.

[3]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[4]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[5]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[6]  Julio C. Facelli,et al.  Identification of pneumonia and influenza deaths using the death certificate pipeline , 2012, BMC Medical Informatics and Decision Making.

[7]  Farah Magrabi,et al.  Using statistical text classification to identify health information technology incidents , 2013, J. Am. Medical Informatics Assoc..

[8]  Anthony N. Nguyen,et al.  A Simple Pipeline Application for Identifying and Negating SNOMED Clinical Terminology in Free Text , 2009 .

[9]  B Rachet,et al.  Cancer survival in Australia, Canada, Denmark, Norway, Sweden, and the UK, 1995–2007 (the International Cancer Benchmarking Partnership): an analysis of population-based cancer registry data , 2011, Lancet.

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  Carlos Martínez,et al.  The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records , 2012, BMC Medical Informatics and Decision Making.

[13]  Michael Hogarth,et al.  Using the UMLS and Simple Statistical Methods to Semantically Categorize Causes of Death on Death Certificates. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[14]  Aliza K Fink,et al.  The accuracy of cancer mortality statistics based on death certificates in the United States. , 2011, Cancer epidemiology.

[15]  J CLEMMESEN,et al.  The Danish Cancer Registry; problems and results. , 2009, Acta pathologica et microbiologica Scandinavica.

[16]  MEDICAL certification of cause of death; instructions for physicians on use of international form of medical certificate of cause of death. , 1952, Bulletin of the World Health Organization. Supplement.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .