Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis

OBJECTIVE We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel. METHODS Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations. RESULTS The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that co-occurring diagnoses from Medline citations do not always represent clinically meaningful associations. DISCUSSION Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations. CONCLUSIONS In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.

[1]  Peter Woollard,et al.  A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. , 2014, Drug discovery today.

[2]  Christopher G. Chute,et al.  Using Linked Data for Mining Drug-Drug Interactions in Electronic Health Records , 2014, MedInfo.

[3]  Marcelo Fiszman,et al.  Semantic MEDLINE for Discovery Browsing: Using Semantic Predications and the Literature-Based Discovery Paradigm to Elucidate a Mechanism for the Obesity Paradox , 2013, AMIA.

[4]  Naren Ramakrishnan,et al.  Describing the Relationship between Cat Bites and Human Depression Using Data from an Electronic Health Record , 2013, PloS one.

[5]  I. Sarkar,et al.  Leveraging concept-based approaches to identify potential phyto-therapies , 2013, J. Biomed. Informatics.

[6]  Antonio Jimeno-Yepes,et al.  MeSH indexing based on automatically generated summaries , 2013, BMC Bioinformatics.

[7]  Ramakanth Kavuluru,et al.  Unsupervised Extraction of Diagnosis Codes from EMRs Using Knowledge-Based and Extractive Text Summarization Techniques , 2013, Canadian Conference on AI.

[8]  Nigam H. Shah,et al.  Practice-Based Evidence: Profiling the Safety of Cilostazol by Text-Mining of Clinical Notes , 2013, PloS one.

[9]  Dina Demner-Fushman,et al.  Extracting drug indication information from structured product labels using natural language processing , 2013, J. Am. Medical Informatics Assoc..

[10]  Marius Fieschi,et al.  Design and validation of an automated method to detect known adverse drug reactions in MEDLINE: a contribution from the EU-ADR project , 2013, J. Am. Medical Informatics Assoc..

[11]  Naren Ramakrishnan,et al.  Modeling temporal relationships in large scale clinical associations , 2013, J. Am. Medical Informatics Assoc..

[12]  Cosmin Adrian Bejan,et al.  Assertion modeling and its role in clinical phenotype identification , 2013, J. Biomed. Informatics.

[13]  Mariana L. Neves,et al.  Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts , 2013, Database J. Biol. Databases Curation.

[14]  R. Geetha Ramani,et al.  Data Mining in Clinical Data Sets: A Review , 2012 .

[15]  J. St-Maurice,et al.  A Proof of Concept for Assessing Emergency Room Use with Primary Care Data and Natural Language Processing , 2012, Methods of Information in Medicine.

[16]  William R. Hersh,et al.  Identifying Patients for Clinical Studies from Electronic Health Records: TREC 2012 Medical Records Track at OHSU , 2012, TREC.

[17]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[18]  Lucila Ohno-Machado,et al.  Big science, big data, and a big role for biomedical informatics , 2012, J. Am. Medical Informatics Assoc..

[19]  Julio C. Facelli,et al.  Identification of pneumonia and influenza deaths using the death certificate pipeline , 2012, BMC Medical Informatics and Decision Making.

[20]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[21]  Joel Dudley,et al.  Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets , 2011, J. Biomed. Informatics.

[22]  Sanda M. Harabagiu,et al.  A flexible framework for deriving assertions from electronic medical records , 2011, J. Am. Medical Informatics Assoc..

[23]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[24]  M. Pronicki,et al.  Case report: Infantile systemic hyalinosis: a dental perspective , 2011, European archives of paediatric dentistry : official journal of the European Academy of Paediatric Dentistry.

[25]  R. Altman,et al.  Detecting Drug Interactions From Adverse‐Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels , 2011, Clinical pharmacology and therapeutics.

[26]  R. Rabadán,et al.  Discovering Disease Associations by Integrating Electronic Clinical Data and Medical Literature , 2011, PloS one.

[27]  Doron Lancet,et al.  Mapping of molecular pathways, biomarkers and drug targets for diabetic nephropathy , 2011, Proteomics. Clinical applications.

[28]  Hanbo Chen,et al.  VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R , 2011, BMC Bioinformatics.

[29]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.

[30]  Sharib A. Khan,et al.  What do patients search for when seeking clinical trial information online? , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[31]  Olga Patterson,et al.  Automatic acquisition of sublanguage semantic schema: towards the word sense disambiguation of clinical narratives. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[32]  Robert A. Jenders,et al.  A systematic literature review of automated clinical coding and classification systems , 2010, J. Am. Medical Informatics Assoc..

[33]  Daniel Wolff,et al.  Neurological manifestations of chronic graft-versus-host disease after allogeneic haematopoietic stem cell transplantation: report from the Consensus Conference on Clinical Practice in chronic graft-versus-host disease. , 2010, Brain : a journal of neurology.

[34]  Bridget T. McInnes,et al.  Automated Identification of Synonyms in Biomedical Acronym Sense Inventories , 2010, Louhi@NAACL-HLT.

[35]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[36]  Deendayal Dinakarpandian,et al.  Automated Ontological Gene Annotation for Computing Disease Similarity , 2010, Summit on translational bioinformatics.

[37]  Umit Topaloglu,et al.  Concept Discovery for Pathology Reports using an N-gram Model , 2010, Summit on translational bioinformatics.

[38]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[39]  Leon French,et al.  Application and evaluation of automated semantic annotation of gene expression experiments , 2009, Bioinform..

[40]  David A. Hanauer,et al.  Exploring Clinical Associations Using ‘-Omics’ Based Enrichment Analyses , 2009, PloS one.

[41]  Halil Kilicoglu,et al.  Semantic MEDLINE: A web application for managing the results of PubMed searches , 2008, SMBM 2008.

[42]  Jordi Castellsague,et al.  Validation of ICD‐9 codes with a high positive predictive value for incident strokes resulting in hospitalization using Medicaid health data , 2008, Pharmacoepidemiology and drug safety.

[43]  Michael Krauthammer,et al.  Mapping terms to UMLS concepts of the same semantic type. , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[44]  Olivier Bodenreider,et al.  From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches , 2007, BioNLP@ACL.

[45]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[46]  Peter J. Haug,et al.  Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation , 2006, J. Biomed. Informatics.

[47]  Barry Robson,et al.  Data mining and clinical data repositories: Insights from a 667, 000 patient data set , 2006, Comput. Biol. Medicine.

[48]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[49]  John F. Hurdle,et al.  Measuring diagnoses: ICD code accuracy. , 2005, Health services research.

[50]  Barend Mons,et al.  Online tools to support literature-based discovery in the life sciences , 2005, Briefings Bioinform..

[51]  Ugo Fedeli,et al.  Measuring Accuracy of Discharge Diagnoses for a Region-Wide Surveillance of Hospitalized Strokes , 2005, Stroke.

[52]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[53]  David R. Fraser,et al.  Vitamin D-deficiency in Asia , 2004, The Journal of Steroid Biochemistry and Molecular Biology.

[54]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[55]  Marc Weeber,et al.  Using concepts in literature-based discovery: Simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries , 2001, J. Assoc. Inf. Sci. Technol..

[56]  G A Colditz,et al.  A prospective study of folate intake and the risk of breast cancer. , 1999, JAMA.

[57]  György Surján,et al.  Questions on validity of International Classification of Diseases-coded diagnoses , 1999, Int. J. Medical Informatics.

[58]  G Haraoka,et al.  First case of surgical treatment of Farber's disease. , 1997, Annals of plastic surgery.

[59]  Captain Y. B. Nusfield,et al.  Public Health , 1906, Canadian Medical Association journal.

[60]  Erik M. van Mulligen,et al.  Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: exploring the use of literature-based discovery in primary care research , 2014, J. Am. Medical Informatics Assoc..

[61]  Riccardo Bellazzi,et al.  A Unified Medical Language System (UMLS) Based System for Literature-Based Discovery in Medicine , 2013, MedInfo.

[62]  Frank van Harmelen,et al.  Identifying Most Relevant Concepts to Describe Clinical Trial Eligibility Criteria , 2013, HEALTHINF.

[63]  Russ B. Altman,et al.  A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports , 2012, J. Am. Medical Informatics Assoc..

[64]  Kai Zheng,et al.  Hedging their Mets: The Use of Uncertainty Terms in Clinical Documents and its Potential Implications when Sharing the Documents with Patients , 2012, AMIA.

[65]  Marc Overhage,et al.  An evaluation of the UMLS in representing corpus derived clinical concepts. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[66]  James G. Mork,et al.  A bottom-up approach to MEDLINE indexing recommendations. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[67]  Juliane Fluck,et al.  Information Retrieval Framework for Technology Survey in Biomedical and Chemistry Literature , 2011, TREC.

[68]  Christopher G. Chute,et al.  Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition , 2008, LREC.

[69]  H. Suominen Machine Learning to Automate the Assignment of Diagnosis Codes to Free-text Radiology Reports : a Method Description , 2008 .

[70]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[71]  Thomas C. Rindflesch,et al.  Identifying Respiratory Findings in Emergency Department Reports for Biosurveillance using MetaMap , 2004, MedInfo.

[72]  Daniel P Lorence,et al.  Benchmarking variation in coding accuracy across the United States. , 2003, Journal of health care finance.

[73]  Wanda Pratt,et al.  A Study of Biomedical Concept Identification: MetaMap vs. People , 2003, AMIA.

[74]  Daniel P Lorence,et al.  Disparity in coding concordance: do physicians and coders agree? , 2003, Journal of health care finance.

[75]  Padmini Srinivasan,et al.  Exploring text mining from MEDLINE , 2002, AMIA.

[76]  Saso Dzeroski,et al.  Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS , 2001, MedInfo.

[77]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[78]  Thomas C. Rindflesch,et al.  Query Expansion Using the UMLS ® Metathesaurus ® , 1997 .

[79]  A. Aronson MetaMap Evaluation , 1991 .