Building the graph of medicine from millions of clinical narratives

Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.

[1]  J. Myers,et al.  The INTERNIST-1/QUICK MEDICAL REFERENCE project--status report. , 1986, The Western journal of medicine.

[2]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  D. Heckerman,et al.  ,81. Introduction , 2022 .

[5]  D E Heckerman,et al.  Toward Normative Expert Systems: Part II Probability-Based Representations for Efficient Knowledge Acquisition and Inference , 1992, Methods of Information in Medicine.

[6]  D. Heckerman,et al.  Toward Normative Expert Systems: Part I The Pathfinder Project , 1992, Methods of Information in Medicine.

[7]  Louise Poissant Part I , 1996, Leonardo.

[8]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[9]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[10]  L. Bouter,et al.  How to measure comorbidity. a critical review of available methods. , 2003, Journal of clinical epidemiology.

[11]  Olivier Bodenreider,et al.  Exploring semantic groups through visual approaches , 2003, J. Biomed. Informatics.

[12]  Randolph A. Miller,et al.  Editorial Comments: Pragmatics of Implementing Guidelines on the Front Lines , 2004, J. Am. Medical Informatics Assoc..

[13]  Hematopoietic cell transplantation (HCT)-specific comorbidity index: a new tool for risk assessment before allogeneic HCT. , 2005 .

[14]  M. Sorror,et al.  Hematopoietic cell transplantation (HCT)-specific comorbidity index: a new tool for risk assessment before allogeneic HCT. , 2005, Blood.

[15]  F. Wolfe,et al.  The association of rheumatoid arthritis and its treatment with sinus disease. , 2006, The Journal of rheumatology.

[16]  Torulf Mollestad,et al.  Additional Gene Ontology structure for improved biological reasoning , 2006, Bioinform..

[17]  K. Bretonnel Cohen,et al.  Biological, translational, and clinical language processing , 2007 .

[18]  Martin Vingron,et al.  Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis , 2007, Bioinform..

[19]  A. Rzhetsky,et al.  Probing genetic overlap among complex human phenotypes , 2007, Proceedings of the National Academy of Sciences.

[20]  J. Marrero,et al.  Modified Charlson Comorbidity Index for predicting survival after liver transplantation , 2007, Liver transplantation : official publication of the American Association for the Study of Liver Diseases and the International Liver Transplantation Society.

[21]  K. Bretonnel Cohen,et al.  Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing , 2007 .

[22]  Daniel L. Rubin,et al.  Annotation and query of tissue microarray data using the NCI Thesaurus , 2007, BMC Bioinformatics.

[23]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[24]  C. Turesson,et al.  Cardiovascular co-morbidity in rheumatic diseases , 2008, Vascular health and risk management.

[25]  Maria Blettner,et al.  Part 9 of a Series on Evaluation of Scientific Publications , 2011 .

[26]  Susan C. Weber,et al.  STRIDE - An Integrated Standards-Based Translational Research Informatics Platform , 2009, AMIA.

[27]  M. Blettner,et al.  Interpreting results in 2 x 2 tables: part 9 of a series on evaluation of scientific publications. , 2009, Deutsches Arzteblatt international.

[28]  G. D. De Keulenaer,et al.  The heart failure spectrum: time for a phenotype-oriented approach. , 2009, Circulation.

[29]  J. Avorn,et al.  High-dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data , 2009, Epidemiology.

[30]  Antoine M. van Oijen,et al.  Real-time single-molecule observation of rolling-circle DNA replication , 2009, Nucleic acids research.

[31]  Gunther Schadow,et al.  A Recommendation Algorithm for Automating Corollary Order Generation , 2009, AMIA.

[32]  Kristin L. Sainani The Problem of Multiple Testing , 2009, PM & R : the journal of injury, function, and rehabilitation.

[33]  H. John,et al.  Cardiovascular co-morbidity in early rheumatoid arthritis. , 2009, Best practice & research. Clinical rheumatology.

[34]  E. Lopez-Gonzalez,et al.  Determinants of Under-Reporting of Adverse Drug Reactions , 2009, Drug safety.

[35]  I. Kohane The twin questions of personalized medicine: who are you and whom do you most resemble? , 2009, Genome Medicine.

[36]  B. Starfield,et al.  Defining Comorbidity: Implications for Understanding Health and Health Services , 2009, The Annals of Family Medicine.

[37]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[38]  Rong Xu,et al.  A Comprehensive Analysis of Five Million UMLS Metathesaurus Terms Using Eighteen Million MEDLINE Citations. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[39]  Mark A Musen,et al.  An ontology-neutral framework for enrichment analysis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[40]  J. Rassen,et al.  Confounding Control in Healthcare Database Research: Challenges and Potential Approaches , 2010, Medical care.

[41]  Stephen M Downs,et al.  A method to compute treatment suggestions from local order entry data. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[42]  Peter H. Baenziger,et al.  In silico functional profiling of human disease‐associated and polymorphic amino acid substitutions , 2010, Human mutation.

[43]  Clement Jonquet,et al.  The Lexicon Builder Web service: Building Custom Lexicons from two hundred Biomedical Ontologies. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[44]  R A Miller,et al.  A History of the INTERNIST-1 and Quick Medical Reference (QMR) Computer-Assisted Diagnosis Projects, with Lessons Learned , 2010, Yearbook of Medical Informatics.

[45]  D. Koller,et al.  Integration of Early Physiological Responses Predicts Later Illness Severity in Preterm Infants , 2010, Science Translational Medicine.

[46]  J. Overhage,et al.  Advancing the Science for Active Surveillance: Rationale and Design for the Observational Medical Outcomes Partnership , 2010, Annals of Internal Medicine.

[47]  R. Rabadán,et al.  Discovering Disease Associations by Integrating Electronic Clinical Data and Medical Literature , 2011, PloS one.

[48]  D. Classen,et al.  'Global trigger tool' shows that adverse events in hospitals may be ten times greater than previously measured. , 2011, Health affairs.

[49]  Cédrick Fairon,et al.  Annotation analysis for testing drug safety signals using unstructured clinical notes , 2012, J. Biomed. Semant..

[50]  S. Wenzel Asthma phenotypes: the evolution from clinical to molecular approaches , 2012, Nature Medicine.

[51]  Cui Tao,et al.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis , 2012, J. Am. Medical Informatics Assoc..

[52]  Madeleine Udell,et al.  Analyzing Patterns of Drug Use in Clinical Notes for Patient Safety , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[53]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[54]  A. Rainoldi,et al.  Part II , 2012 .

[55]  T. Lasko,et al.  Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data , 2013, PloS one.

[56]  Charles F. Bearden,et al.  A Nondegenerate Code of Deleterious Variants in Mendelian Loci Contributes to Complex Disease Risk , 2013, Cell.

[57]  J. Pathak,et al.  Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[58]  N. Shah,et al.  Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research , 2013, Pediatric Rheumatology.

[59]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[60]  Jonathan H. Chen,et al.  Automated Physician Order Recommendations and Outcome Predictions by Data-Mining Electronic Medical Records , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[61]  Lao Juan,et al.  Development and Validation of a Scale for Measuring Instructors' Attitudes toward Concept-Based or Reform-Oriented Teaching of Introductory Statistics in the Health and Behavioral Sciences , 2010, 1007.3219.