Automatic lymphoma classification with sentence subgraph mining from pathology reports.

OBJECTIVE Pathology reports are rich in narrative statements that encode a complex web of relations among medical concepts. These relations are routinely used by doctors to reason on diagnoses, but often require hand-crafted rules or supervised learning to extract into prespecified forms for computational disease modeling. We aim to automatically capture relations from narrative text without supervision. METHODS We design a novel framework that translates sentences into graph representations, automatically mines sentence subgraphs, reduces redundancy in mined subgraphs, and automatically generates subgraph features for subsequent classification tasks. To ensure meaningful interpretations over the sentence graphs, we use the Unified Medical Language System Metathesaurus to map token subsequences to concepts, and in turn sentence graph nodes. We test our system with multiple lymphoma classification tasks that together mimic the differential diagnosis by a pathologist. To this end, we prevent our classifiers from looking at explicit mentions or synonyms of lymphomas in the text. RESULTS AND CONCLUSIONS We compare our system with three baseline classifiers using standard n-grams, full MetaMap concepts, and filtered MetaMap concepts. Our system achieves high F-measures on multiple binary classifications of lymphoma (Burkitt lymphoma, 0.8; diffuse large B-cell lymphoma, 0.909; follicular lymphoma, 0.84; Hodgkin lymphoma, 0.912). Significance tests show that our system outperforms all three baselines. Moreover, feature analysis identifies subgraph features that contribute to improved performance; these features agree with the state-of-the-art knowledge about lymphoma classification. We also highlight how these unsupervised relation features may provide meaningful insights into lymphoma classification.

[1]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[2]  N. Harris,et al.  A double-positive CD4+CD8+ T-cell population is commonly found in nodular lymphocyte predominant Hodgkin lymphoma. , 2006, American journal of clinical pathology.

[3]  Ahmet Dogan,et al.  Fibrin-associated Large B-cell Lymphoma: Part of the Spectrum of Cardiac Lymphomas , 2012, The American journal of surgical pathology.

[4]  N. Harris,et al.  Nodular Lymphocyte-Predominant Hodgkin Lymphoma With Atypical T Cells: A Morphologic Variant Mimicking Peripheral T-Cell Lymphoma , 2011, The American journal of surgical pathology.

[5]  Brian Wilson,et al.  Case Report: Identifying Smokers with a Medical Extraction System , 2008, J. Am. Medical Informatics Assoc..

[6]  S. Bentley WHO Classification of Tumours: Pathology and Genetics. Tumours of Haematopoietic and Lymphoid Tissues , 2003 .

[7]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[8]  E. Holly,et al.  Expert Review of Non-Hodgkin’s Lymphomas in a Population-Based Cancer Registry , 2004, Cancer Epidemiology Biomarkers & Prevention.

[9]  R. Lukes,et al.  Immunologic characterization of human malignant lymphomas , 1974, Cancer.

[10]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[11]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[12]  Stefano A Pileri,et al.  ALK- anaplastic large-cell lymphoma is clinically and immunophenotypically different from both ALK+ ALCL and peripheral T-cell lymphoma, not otherwise specified: report from the International Peripheral T-Cell Lymphoma Project. , 2008, Blood.

[13]  H. Rappaport,et al.  Tumors of the hematopoietic system , 1966 .

[14]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[15]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[16]  Jean McGowan-Jordan,et al.  ISCN 2013 : an international system for human cytogenetic nomenclature (2013) : recommendations of the International Standing Committee on Human Cytogenetic Nomenclature , 2005 .

[17]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..

[18]  Matija Snuderl,et al.  B-cell Lymphomas With Concurrent IGH-BCL2 and MYC Rearrangements Are Aggressive Neoplasms With Clinical and Pathologic Features Distinct From Burkitt Lymphoma and Diffuse Large B-cell Lymphoma , 2010, The American journal of surgical pathology.

[19]  Martha Palmer,et al.  Getting the Most out of Transition-based Dependency Parsing , 2011, ACL.

[20]  K. J. Evans,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[21]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[22]  Iscn International System for Human Cytogenetic Nomenclature , 1978 .

[23]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[24]  Aliyah R. Sohani,et al.  HHV8-positive, EBV-positive Hodgkin lymphoma-like large B-cell lymphoma and HHV8-positive intravascular large B-cell lymphoma , 2009, Modern Pathology.

[25]  Vasudevan Jagannathan,et al.  Natural language processing framework to assess clinical conditions. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[26]  J Kaldor,et al.  Use of the WHO lymphoma classification in a population-based epidemiological study. , 2004, Annals of oncology : official journal of the European Society for Medical Oncology.

[27]  Robert J. Taylor,et al.  Implementation Brief: Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data , 2009, J. Am. Medical Informatics Assoc..

[28]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[29]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[30]  Aaron M. Cohen,et al.  Case Report: Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes , 2008, J. Am. Medical Informatics Assoc..

[31]  Özlem Uzuner,et al.  Specializing for predicting obesity and its co-morbidities , 2009, J. Biomed. Informatics.

[32]  Domonkos Tikk,et al.  Research Paper: Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier , 2009, J. Am. Medical Informatics Assoc..

[33]  Robert Tibshirani,et al.  Characterization of Variant Patterns of Nodular Lymphocyte Predominant Hodgkin Lymphoma with Immunohistologic and Clinical Correlation , 2003, The American journal of surgical pathology.

[34]  Peter Szolovits,et al.  Syntactically-Informed Semantic Category Recognizer for Discharge Summaries , 2006, AMIA.

[35]  A. H. T. Robb-Smith,et al.  U.S. NATIONAL CANCER INSTITUTE WORKING FORMULATION OF NON-HODGKIN'S LYMPHOMAS FOR CLINICAL USE , 1982, The Lancet.

[36]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[37]  Özlem Uzuner,et al.  Semantic relations for problem-oriented medical records , 2010, Artif. Intell. Medicine.

[38]  D. Hossfeld E.S. Jaffe, N.L. Harris, H. Stein, J.W. Vardiman (eds). World Health Organization Classification of Tumours: Pathology and Genetics of Tumours of Haematopoietic and Lymphoid Tissues , 2002 .

[39]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[40]  S. Alkan,et al.  Mucosa-associated lymphoid tissue-type lymphomas occurring in post-transplantation patients. , 2000, The American journal of surgical pathology.

[41]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[42]  István Hegedüs,et al.  Research Paper: Semi-automated Construction of Decision Rules to Predict Morbidities from Clinical Texts , 2009, J. Am. Medical Informatics Assoc..

[43]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[44]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[45]  K. Ohe,et al.  Patient Status Classification by using Rule based Sentence Extraction and BM 25-kNN based Classifier , 2006 .

[46]  A. Norton,et al.  Classification of non-Hodgkin's lymphomas. , 1996, Bailliere's clinical haematology.

[47]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[48]  William Long,et al.  Extracting Diagnoses from Discharge Summaries , 2005, AMIA.

[49]  Dan Klein,et al.  Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon , 2005 .

[50]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[51]  Frans Coenen,et al.  A survey of frequent subgraph mining algorithms , 2012, The Knowledge Engineering Review.