Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports

Although registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.

[1]  Anthony N. Nguyen,et al.  Application of Information Technology: Collection of Cancer Stage Data by Classifying Free-text Medical Reports , 2007, J. Am. Medical Informatics Assoc..

[2]  David A Hanauer,et al.  The registry case finding engine: an automated tool to identify cancer cases from unstructured, free-text pathology reports and clinical notes. , 2007, Journal of the American College of Surgeons.

[3]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[4]  Yue Li,et al.  Information extraction from pathology reports in a hospital setting , 2011, CIKM '11.

[5]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[6]  P. Trott,et al.  International Classification of Diseases for Oncology , 1977 .

[7]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Michael Feldman,et al.  caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research , 2010, J. Am. Medical Informatics Assoc..

[10]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[11]  April Fritz,et al.  International Classification of Diseases for Oncology: ICD-0. , 2000 .

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Cui Tao,et al.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis , 2012, J. Am. Medical Informatics Assoc..

[14]  Olivier Bodenreider,et al.  Exploring semantic groups through visual approaches , 2003, J. Biomed. Informatics.

[15]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[18]  A Burgun,et al.  Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer , 2011, Methods of Information in Medicine.