Data-driven Sublanguage Analysis for Cancer Genomics Knowledge Modeling: Applications in Mining Oncological Genetics Information from Patients' Genetic Reports.

Despite an abundance of information in clinical genetic testing reports, information is oftentimes not well documented/utilized for decision making. Unstructured information in genetic reports can contribute to long-term patient management and future translational research. Thus, we proposed a knowledge model that could manage unstructured information in medical genetic reports and facilitate knowledge extraction, curation and updating. For this pilot study, we used a dataset including 1,565 cancer genetics reports of Mayo Clinic patients. We used a previously developed, data-driven discovery pipeline that involves both semantic annotation and co-occurrence association analysis to establish a knowledge model. We showed that compared to genetic reports, around 56% of testing results are missing or incomplete in the clinical notes. We built a genetic report knowledge model and highlighted four key semantic groups including "Genes and Gene Products" and "Treatments". Coverage of term annotation was 99.5%. Accuracies of term annotation and relationship extraction were 98.9% and 92.9% respectively.

[1]  Edward Choi,et al.  Graph Convolutional Transformer: Learning the Graphical Structure of Electronic Health Records , 2019, ArXiv.

[2]  Kyongbum Lee,et al.  An algorithm for modularity analysis of directed and weighted biological networks based on edge-betweenness centrality , 2006, Bioinform..

[3]  Yanshan Wang,et al.  Natural Language Processing for the Identification of Silent Brain Infarcts From Neuroimaging Reports , 2019, JMIR medical informatics.

[4]  Collin F. Baker FrameNet: A Knowledge Base for Natural Language Processing , 2014 .

[5]  A. Hauschild,et al.  Improved survival with vemurafenib in melanoma with BRAF V600E mutation. , 2011, The New England journal of medicine.

[6]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[7]  Colin C Pritchard,et al.  ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. , 2012, The Journal of molecular diagnostics : JMD.

[8]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[9]  Hongfang Liu,et al.  Using data-driven sublanguage pattern mining to induce knowledge models: application in medical image reports knowledge representation , 2018, BMC Medical Informatics and Decision Making.

[10]  T. Fleming,et al.  Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. , 2001, The New England journal of medicine.

[11]  Laetitia Huiart,et al.  Cancer risks associated with germline mutations in MLH1, MSH2, and MSH6 genes in Lynch syndrome. , 2011, JAMA.

[12]  Waqas Anwar,et al.  Contextual advertising using keyword extraction through collocation , 2009, FIT.

[13]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[14]  Marcelo Simoes Introduction to Fuzzy Control , 2003 .

[15]  Julie O. Culver,et al.  Essential Elements of Genetic Cancer Risk Assessment, Counseling, and Testing: Updated Recommendations of the National Society of Genetic Counselors , 2012, Journal of Genetic Counseling.

[16]  William J. Hogan,et al.  Clinical Applications and Utility of a Precision Medicine Approach for Patients With Unexplained Cytopenias. , 2019, Mayo Clinic proceedings.

[17]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[18]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  T. Kunkel,et al.  Microsatellite instability, mismatch repair deficiency, and genetic defects in human cancer cell lines. , 1995, Cancer research.

[20]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[21]  Christopher G Chute,et al.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[22]  John M. S. Bartlett,et al.  PIK3CA Genotype and a PIK3CA Mutation-Related Gene Signature and Response to Everolimus and Letrozole in Estrogen Receptor Positive Breast Cancer , 2013, PloS one.

[23]  Benjamin Solomon,et al.  Updated Molecular Testing Guideline for the Selection of Lung Cancer Patients for Treatment With Targeted Tyrosine Kinase Inhibitors: Guideline From the College of American Pathologists, the International Association for the Study of Lung Cancer, and the Association for Molecular Pathology , 2018, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[24]  Jos Jonkers,et al.  Genomic instability in breast and ovarian cancers: translation into clinical predictive biomarkers , 2011, Cellular and Molecular Life Sciences.

[25]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[26]  Jimeng Sun,et al.  MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare , 2018, NeurIPS.

[27]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[28]  Yoon-Koo Kang,et al.  Association between deficient mismatch repair system and efficacy to irinotecan‐containing chemotherapy in metastatic colon cancer , 2011, Cancer science.

[29]  Razelle Kurzrock,et al.  PI3K/AKT/mTOR inhibitors in patients with breast and gynecologic malignancies harboring PIK3CA mutations. , 2012, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[30]  David Sánchez,et al.  A methodology to learn ontological attributes from the Web , 2010, Data Knowl. Eng..

[31]  Darryl Shibata,et al.  Tumour susceptibility and spontaneous mutation in mice deficient in Mlh1, Pms1 and Pms2 DMA mismatch repair , 1998, Nature Genetics.

[32]  Yingxu Wang,et al.  Towards the abstract system theory of system science for cognitive and intelligent systems , 2015, Complex & Intelligent Systems.

[33]  Wendy W. Chapman,et al.  ConText: An Algorithm for Identifying Contextual Features from Clinical Text , 2007, BioNLP@ACL.

[34]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[35]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[36]  Sue Povey,et al.  The HUGO Gene Nomenclature Database, 2006 updates , 2005, Nucleic Acids Res..

[37]  P. Ferguson,et al.  Chronic recurrent multifocal osteomyelitis: a concise review and genetic update. , 2007, Clinical orthopaedics and related research.

[38]  I. Lipkus,et al.  Developing patient-friendly genetic and genomic test reports: formats to promote patient engagement and understanding , 2014, Genome Medicine.

[39]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[40]  Witold Pedrycz,et al.  Design of rule-based models through information granulation , 2016, Expert Syst. Appl..

[41]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[42]  Francesca Molinari,et al.  Deregulation of the PI3K and KRAS signaling pathways in human cancer cells determines their response to everolimus. , 2010, The Journal of clinical investigation.

[43]  L. V. van't Veer,et al.  70-Gene Signature as an Aid to Treatment Decisions in Early-Stage Breast Cancer. , 2016, The New England journal of medicine.

[44]  Christiane Fellbaum,et al.  Towards a Representation of Idioms in WordNet , 1998, WordNet@ACL/COLING.

[45]  Ira M Lubin,et al.  A report template for molecular genetic tests designed to improve communication between the clinician and laboratory. , 2012, Genetic testing and molecular biomarkers.

[46]  Daniel Nilsson,et al.  An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge , 2014, Genome Biology.

[47]  Anita Grigoriadis,et al.  Genomic scars as biomarkers of homologous recombination deficiency and drug response in breast and ovarian cancers , 2014, Breast Cancer Research.

[48]  Bob J. Wielinga,et al.  Using explicit ontologies in KBS development , 1997, Int. J. Hum. Comput. Stud..

[49]  Vinay Prasad,et al.  Cancer Drugs Approved Based on Biomarkers and Not Tumor Type-FDA Approval of Pembrolizumab for Mismatch Repair-Deficient Solid Cancers. , 2017, JAMA oncology.

[50]  Victor I. Chang,et al.  Towards knowledge modeling and manipulation technologies: A survey , 2016, Int. J. Inf. Manag..

[51]  Han Xiao,et al.  TransG : A Generative Model for Knowledge Graph Embedding , 2015, ACL.

[52]  John Quackenbush,et al.  Profiles of Genomic Instability in High-Grade Serous Ovarian Cancer Predict Treatment Outcome , 2012, Clinical Cancer Research.

[53]  Noah A. Smith,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[54]  Veda C. Storey,et al.  Big data technologies and Management: What conceptual modeling can do , 2017, Data Knowl. Eng..

[55]  Peter Szolovits,et al.  Genetic Misdiagnoses and the Potential for Health Disparities. , 2016, The New England journal of medicine.

[56]  Suzanne D. Conzen,et al.  Phase II trial of temsirolimus in patients with metastatic breast cancer , 2012, Breast Cancer Research and Treatment.