ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites

Electronic Health Records (EHR) contain extensive information on various health outcomes and risk factors, and therefore have been broadly used in healthcare research. Integrating EHR data from multiple clinical sites can accelerate knowledge discovery and risk prediction by providing a larger sample size in a more general population which potentially reduces clinical bias and improves estimation and prediction accuracy. To overcome the barrier of patient-level data sharing, distributed algorithms are developed to conduct statistical analyses across multiple sites through sharing only aggregated information. The current distributed algorithm often requires iterative information evaluation and transferring across sites, which can potentially lead to a high communication cost in practical settings. In this study, we propose a privacy-preserving and communication-efficient distributed algorithm for logistic regression without requiring iterative communications across sites. Our simulation study showed our algorithm reached comparative accuracy comparing to the oracle estimator where data are pooled together. We applied our algorithm to an EHR data from the University of Pennsylvania health system to evaluate the risks of fetal loss due to various medication exposures.

[1]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[2]  Patrick B. Ryan,et al.  Validation of a common data model for active safety surveillance research , 2012, J. Am. Medical Informatics Assoc..

[3]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[4]  Nicola J Cooper,et al.  Meta-analysis of rare and adverse event data , 2002, Expert review of pharmacoeconomics & outcomes research.

[5]  Li Li,et al.  Uncovering exposures responsible for birth season – disease effects: a global study , 2017, J. Am. Medical Informatics Assoc..

[6]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[7]  P. Doering,et al.  FDA Labeling System for Drugs in Pregnancy , 2001, The Annals of pharmacotherapy.

[8]  Richard A. Rudick,et al.  Time to Integrate Clinical and Research Informatics , 2012, Science Translational Medicine.

[9]  Xiaoqian Jiang,et al.  WebDISCO: a web service for distributed cox model learning without patient-level data sharing , 2015, J. Am. Medical Informatics Assoc..

[10]  Lucila Ohno-Machado,et al.  pSCANNER: patient-centered Scalable National Network for Effectiveness Research , 2014, J. Am. Medical Informatics Assoc..

[11]  Jianqing Fan,et al.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS. , 2018, Annals of statistics.

[12]  Jihoon Kim,et al.  Grid Binary LOgistic REgression (GLORE): building shared models without sharing data , 2012, J. Am. Medical Informatics Assoc..

[13]  Fernanda Polubriaginof,et al.  Development of A Machine Learning Algorithm to Classify Drugs Of Unknown Fetal Effect , 2017, Scientific Reports.

[14]  Vassilis Koutkias,et al.  Large-scale adverse effects related to treatment evidence standardization (LAERTES): an open scalable system for linking pharmacovigilance evidence sources with clinical data , 2017, J. Biomed. Semant..

[15]  Douglas Iain Ross Boyle,et al.  BioGrid Australia and GRHANITE™: Privacy-Protecing Subject Matching , 2011, HIC.

[16]  D. Blumenthal,et al.  The "meaningful use" regulation for electronic health records. , 2010, The New England journal of medicine.

[17]  Mladen Kolar,et al.  Efficient Distributed Learning with Sparsity , 2016, ICML.

[18]  Douglas MacFadden,et al.  Application of Information Technology The Shared Health Research Information Network ( SHRINE ) : A Prototype Federated Query Tool for Clinical Data Repositories , 2014 .

[19]  Valerio Persico,et al.  Big Data for Health , 2019, Encyclopedia of Big Data Technologies.

[20]  Yun Yang,et al.  Communication-Efficient Distributed Statistical Inference , 2016, Journal of the American Statistical Association.

[21]  Constantin F. Aliferis,et al.  Studies in Health Technology and Informatics , 2007 .

[22]  Rae Woong Park,et al.  Characterizing treatment pathways at scale using the OHDSI network , 2016, Proceedings of the National Academy of Sciences.

[23]  H. Blom,et al.  Homocysteine and Folate Levels as Risk Factors for Recurrent Early Pregnancy Loss , 2000, Obstetrics and gynecology.

[24]  George Hripcsak,et al.  Birth month affects lifetime disease risk: a phenome-wide method , 2015, J. Am. Medical Informatics Assoc..

[25]  Keith Marsolo,et al.  PEDSnet: a National Pediatric Learning Health System , 2014, J. Am. Medical Informatics Assoc..