Evaluating statistical approaches to leverage large clinical datasets for uncovering therapeutic and adverse medication effects

Motivation Phenome-wide association studies (PheWAS) have been used to discover many genotype-phenotype relationships and have the potential to identify therapeutic and adverse drug outcomes using longitudinal data within electronic health records (EHRs). However, the statistical methods for PheWAS applied to longitudinal EHR medication data have not been established. Results In this study, we developed methods to address two challenges faced with reuse of EHR for this purpose: confounding by indication, and low exposure and event rates. We used Monte Carlo simulation to assess propensity score (PS) methods, focusing on two of the most commonly used methods, PS matching and PS adjustment, to address confounding by indication. We also compared two logistic regression approaches (the default of Wald versus Firth's penalized maximum likelihood, PML) to address complete separation due to sparse data with low exposure and event rates. PS adjustment resulted in greater power than PS matching, while controlling Type I error at 0.05. The PML method provided reasonable P-values, even in cases with complete separation, with well controlled Type I error rates. Using PS adjustment and the PML method, we identify novel latent drug effects in pediatric patients exposed to two common antibiotic drugs, ampicillin and gentamicin. Availability and implementation R packages PheWAS and EHR are available at https://github.com/PheWAS/PheWAS and at CRAN (https://www.r-project.org/), respectively. The R script for data processing and the main analysis is available at https://github.com/choileena/EHR. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Jasjeet S. Sekhon,et al.  Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R , 2008 .

[2]  Simon M Lin,et al.  Opportunities for drug repositioning from phenome-wide association studies , 2015, Nature Biotechnology.

[3]  Leena Choi,et al.  Elucidating the Foundations of Statistical Inference with 2 x 2 Tables , 2015, PloS one.

[4]  Joshua C. Denny,et al.  R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment , 2014, Bioinform..

[5]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[6]  Louise Marston,et al.  Self-harm, Unintentional Injury, and Suicide in Bipolar Disorder During Maintenance Mood Stabilizer Treatment: A UK Population-Based Electronic Health Records Study. , 2016, JAMA psychiatry.

[7]  S. Hebbring The challenges, advantages and future of phenome-wide association studies , 2014, Immunology.

[8]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[9]  Jennifer M. Polinski,et al.  Comparative effectiveness of generic versus brand-name antiepileptic medications , 2015, Epilepsy & Behavior.

[10]  Peter Szolovits,et al.  Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. , 2013, Arthritis and rheumatism.

[11]  D. Madigan,et al.  Medication-Wide Association Studies , 2013, CPT: pharmacometrics & systems pharmacology.

[12]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[13]  A. Pariente,et al.  Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? , 2009, Pharmacoepidemiology and drug safety.

[14]  Patrice Degoulet,et al.  Phenome-Wide Association Studies on a Quantitative Trait: Application to TPMT Enzyme Activity and Thiopurine Therapy in Pharmacogenomics , 2013, PLoS Comput. Biol..

[15]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[16]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[17]  R Plomin,et al.  Phenome-wide analysis of genome-wide polygenic scores , 2015, Molecular Psychiatry.

[18]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[19]  P. Austin,et al.  Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies , 2010, Pharmaceutical statistics.

[20]  Rolf H H Groenwold,et al.  Reporting of covariate selection and balance assessment in propensity score analysis is suboptimal: a systematic review. , 2015, Journal of clinical epidemiology.

[21]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[22]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[23]  George Hripcsak,et al.  Birth month affects lifetime disease risk: a phenome-wide method , 2015, J. Am. Medical Informatics Assoc..

[24]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[25]  Hui Li,et al.  Personal health record use for children and health care utilization: propensity score-matched cohort analysis , 2015, J. Am. Medical Informatics Assoc..

[26]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[27]  P. Rosenbaum Model-Based Direct Adjustment , 1987 .

[28]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[29]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[30]  Melissa A. Basford,et al.  Genome- and Phenome-Wide Analyses of Cardiac Conduction Identifies Markers of Arrhythmia Risk , 2013, Circulation.

[31]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .