Building Causal Graphs from Medical Literature and Electronic Medical Records

Large repositories of medical data, such as Electronic Medical Record (EMR) data, are recognized as promising sources for knowledge discovery. Effective analysis of such repositories often necessitate a thorough understanding of dependencies in the data. For example, if the patient age is ignored, then one might wrongly conclude a causal relationship between cataract and hypertension. Such confounding variables are often identified by causal graphs, where variables are connected by causal relationships. Current approaches to automatically building such graphs are based on text analysis over medical literature; yet, the result is typically a large graph of low precision. There are statistical methods for constructing causal graphs from observational data, but they are less suitable for dealing with a large number of covariates, which is the case in EMR data. Consequently, confounding variables are often identified by medical domain experts via a manual, expensive, and time-consuming process. We present a novel approach for automatically constructing causal graphs between medical conditions. The first part is a novel graph-based method to better capture causal relationships implied by medical literature, especially in the presence of multiple causal factors. Yet even after using these advanced text-analysis methods, the text data still contains many weak or uncertain causal connections. Therefore, we construct a second graph for these terms based on an EMR repository of over 1.5M patients. We combine the two graphs, leaving only edges that have both medical-text-based and observational evidence. We examine several strategies to carry out our approach, and compare the precision of the resulting graphs using medical experts. Our results show a significant improvement in the precision of any of our methods compared to the state of the art.

[1]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[2]  Halil Kilicoglu,et al.  SemMedDB: a PubMed-scale repository of biomedical semantic predications , 2012, Bioinform..

[3]  Li Lin,et al.  Estimating the causal effects of chronic disease combinations on 30-day hospital readmissions based on observational Medicaid data , 2018, J. Am. Medical Informatics Assoc..

[4]  Shaul Markovitch,et al.  Learning to Predict from Textual Data , 2012, J. Artif. Intell. Res..

[5]  Marius Fieschi,et al.  Design and validation of an automated method to detect known adverse drug reactions in MEDLINE: a contribution from the EU-ADR project , 2013, J. Am. Medical Informatics Assoc..

[6]  Alexander D'Amour,et al.  Overlap in observational studies with high-dimensional covariates , 2017, Journal of Econometrics.

[7]  Alain Hauser,et al.  Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs , 2013, 1303.3216.

[8]  C. Yanover,et al.  Estimating the effects of second-line therapy for type 2 diabetes mellitus: retrospective cohort study , 2017, BMJ Open Diabetes Research & Care.

[9]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[10]  Quoc-Chinh Bui,et al.  Extracting causal relations on HIV drug resistance from literature , 2010, BMC Bioinformatics.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  G. Corazza,et al.  Coeliac disease , 2005, Journal of Clinical Pathology.

[13]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[14]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[15]  Vincenzo Lagani,et al.  Predicting Causal Relationships from Biological Data: Applying Automated Causal Discovery on Mass Cytometry Data of Human Immune Cells , 2017, Scientific Reports.

[16]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[17]  David Sontag,et al.  Learning a Health Knowledge Graph from Electronic Medical Records , 2017, Scientific Reports.

[18]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[19]  Donald Steinwachs,et al.  Estimating Causal Effects in Observational Studies using Electronic Health Data: Challenges and (Some) Solutions , 2013, EGEMS.

[20]  Ioannis Tsamardinos,et al.  Constraint-based causal discovery from multiple interventions over overlapping variable sets , 2014, J. Mach. Learn. Res..

[21]  Sanda M. Harabagiu,et al.  Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[22]  Nigam H. Shah,et al.  Building the graph of medicine from millions of clinical narratives , 2014, Scientific Data.

[23]  Tom Heskes,et al.  A Bayesian Approach to Constraint Based Causal Inference , 2012, UAI.