Grid Binary LOgistic REgression (GLORE): building shared models without sharing data

Objective The classification of complex or rare patterns in clinical and genomic data requires the availability of a large, labeled patient set. While methods that operate on large, centralized data sources have been extensively used, little attention has been paid to understanding whether models such as binary logistic regression (LR) can be developed in a distributed manner, allowing researchers to share models without necessarily sharing patient data. Material and methods Instead of bringing data to a central repository for computation, we bring computation to the data. The Grid Binary LOgistic REgression (GLORE) model integrates decomposable partial elements or non-privacy sensitive prediction values to obtain model coefficients, the variance-covariance matrix, the goodness-of-fit test statistic, and the area under the receiver operating characteristic (ROC) curve. Results We conducted experiments on both simulated and clinically relevant data, and compared the computational costs of GLORE with those of a traditional LR model estimated using the combined data. We showed that our results are the same as those of LR to a 10−15 precision. In addition, GLORE is computationally efficient. Limitation In GLORE, the calculation of coefficient gradients must be synchronized at different sites, which involves some effort to ensure the integrity of communication. Ensuring that the predictors have the same format and meaning across the data sets is necessary. Conclusion The results suggest that GLORE performs as well as LR and allows data to remain protected at their original sites.

[1]  L. Ohno-Machado,et al.  Prognosis in critical care. , 2006, Annual review of biomedical engineering.

[2]  Xiaoqian Jiang,et al.  Improving predictions in imbalanced data using Pairwise Expanded Logistic Regression. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  Laszlo T Vaszar,et al.  Privacy issues in personalized medicine. , 2003, Pharmacogenomics.

[4]  W. Kim,et al.  The model for end‐stage liver disease (MELD) , 2007, Hepatology.

[5]  Anand D. Sarwate,et al.  Protecting count queries in study design , 2012, J. Am. Medical Informatics Assoc..

[6]  Kim M. Unertl,et al.  The financial impact of health information exchange on emergency department care , 2011, J. Am. Medical Informatics Assoc..

[7]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[8]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[9]  L Sweeney Privacy and medical-records research. , 1998, The New England journal of medicine.

[10]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[11]  Jihoon Kim,et al.  Effect of data combination on predictive modeling: a study using gene expression data. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[12]  Douglas MacFadden,et al.  Application of Information Technology The Shared Health Research Information Network ( SHRINE ) : A Prototype Federated Query Tool for Clinical Data Repositories , 2014 .

[13]  Jaideep Vaidya,et al.  Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data , 2006, SAC.

[14]  Jun Hu,et al.  A secure protocol for protecting the identity of providers when disclosing data for disease surveillance , 2011, J. Am. Medical Informatics Assoc..

[15]  Isaac S. Kohane,et al.  Strategies for maintaining patient privacy in i2b2 , 2011, J. Am. Medical Informatics Assoc..

[16]  Martin Vingron,et al.  Predicting the outcome of renal transplantation , 2012, J. Am. Medical Informatics Assoc..

[17]  Jihoon Kim,et al.  Using statistical and machine learning to help institutions detect suspicious access to electronic health records , 2011, J. Am. Medical Informatics Assoc..

[18]  J. Zimmerman,et al.  Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited* , 2007, Critical care medicine.

[19]  A Min Tjoa,et al.  Data Warehouse Facilitating Evidence-Based Medicine , 2010, Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development.

[20]  Sue Dill Calloway,et al.  The New HIPAA Law on Privacy and Confidentiality , 2002, Nursing administration quarterly.

[21]  Howard Rockette,et al.  Statistical Evaluation of Diagnostic Performance: Topics in Roc Analysis , 2011 .

[22]  L Ohno-Machado,et al.  Simplified risk score models accurately predict the risk of major in-hospital complications following percutaneous coronary intervention. , 2001, The American journal of cardiology.

[23]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[24]  Gillian Bartlett,et al.  Updated risk factor values and the ability of the multivariable risk score to predict coronary heart disease. , 2004, American journal of epidemiology.

[25]  R. Harrison,et al.  Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. , 1996, European heart journal.

[26]  Aziz A. Boxwala,et al.  Decision support for clinical trial eligibility determination in breast cancer , 1999, AMIA.

[27]  Christopher G. Chute,et al.  Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record , 2012, J. Am. Medical Informatics Assoc..

[28]  Xiaodong Lin,et al.  Privacy preserving regression modelling via distributed computation , 2004, KDD.

[29]  Jihoon Kim,et al.  A patient-driven adaptive prediction technique to improve personalized risk estimation for clinical decision support , 2012, J. Am. Medical Informatics Assoc..

[30]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[31]  Joyce C. Niland,et al.  Improving patient safety via automated laboratory-based adverse event grading , 2011, J. Am. Medical Informatics Assoc..

[32]  I‐Hsien Ting,et al.  Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development: Innovative Methods and Applications , 2010 .

[33]  Anthony V D'Amico,et al.  The ‘CaP Calculator’: an online decision support tool for clinically localized prostate cancer , 2010, BJU international.

[34]  Jaideep Vaidya,et al.  Privacy-Preserving SVM Classification on Vertically Partitioned Data , 2006, PAKDD.

[35]  Lucila Ohno-Machado,et al.  Discrimination and calibration of mortality risk prediction models in interventional cardiology , 2005, J. Biomed. Informatics.

[36]  Jaideep Vaidya,et al.  Knowledge and Information Systems , 2007 .

[37]  Lucila Ohno-Machado,et al.  Journal of Biomedical Informatics , 2002 .

[38]  D. Hosmer,et al.  A review of goodness of fit statistics for use in the development of logistic regression models. , 1982, American journal of epidemiology.

[39]  Lucila Ohno-Machado,et al.  Supratentorial low-grade glioma resectability: statistical predictive analysis based on anatomic MR features and tumor characteristics. , 2006, Radiology.

[40]  L. Ohno-Machado,et al.  Is there an advantage in scoring early embryos on more than one day? , 2009, Human reproduction.

[41]  Balázs Kégl,et al.  Privacy-preserving boosting , 2007, Data Mining and Knowledge Discovery.

[42]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[43]  Jihoon Kim,et al.  iDASH: integrating data for analysis, anonymization, and sharing , 2012, J. Am. Medical Informatics Assoc..

[44]  D. Hosmer,et al.  A comparison of goodness-of-fit tests for the logistic regression model. , 1997, Statistics in medicine.

[45]  L. Melton,et al.  Dr. Melton replies , 1998 .

[46]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[47]  Eugene McCloskey,et al.  Independent clinical validation of a Canadian FRAX tool: Fracture prediction and model calibration , 2010, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[48]  Yaron Denekamp,et al.  A Meta-Data Model for Knowledge in Decision Support Systems , 2003, AMIA.

[49]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[50]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[51]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[52]  Diane L. Seger,et al.  Factors influencing alert acceptance: a novel approach for predicting the success of clinical decision support , 2011, J. Am. Medical Informatics Assoc..

[53]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.