Distributed multivariable modeling for signature development under data protection constraints

Data protection constraints frequently require distributed analysis of data, i.e. individual-level data remains at many different sites, but analysis nevertheless has to be performed jointly. The data exchange is often handled manually, requiring explicit permission before transfer, i.e. the number of data calls and the amount of data should be limited. Thus, only simple summary statistics are typically transferred and aggregated with just a single call, but this does not allow for complex statistical techniques, e.g., automatic variable selection for prognostic signature development. We propose a multivariable regression approach for building a prognostic signature by automatic variable selection that is based on aggregated data from different locations in iterative calls. To minimize the amount of transferred data and the number of calls, we also provide a heuristic variant of the approach. To further strengthen data protection, the approach can also be combined with a trusted third party architecture. We evaluate our proposed method in a simulation study comparing our results to the results obtained with the pooled individual data. The proposed method is seen to be able to detect covariates with true effect to a comparable extent as a method based on individual data, although the performance is moderately decreased if the number of sites is large. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3. To make our approach widely available for application, we provide an implementation on top of the DataSHIELD framework.

[1]  Richard D Riley,et al.  Meta‐analysis using individual participant data: one‐stage and two‐stage approaches, and why they may differ , 2016, Statistics in medicine.

[2]  Hua Chen,et al.  A new synthesis analysis method for building logistic regression prediction models , 2014, Statistics in medicine.

[3]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[4]  Tim Friede,et al.  Do we consent to rules of consent and confidentiality? , 2017, Biometrical journal. Biometrische Zeitschrift.

[5]  Thomas Jaki,et al.  A review of statistical updating methods for clinical prediction models , 2018, Statistical methods in medical research.

[6]  Xiaoqian Jiang,et al.  EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning , 2013, J. Biomed. Informatics.

[7]  Dan Bogdanov,et al.  A new way to protect privacy in large-scale genome-wide association studies , 2013, Bioinform..

[8]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[9]  Gerhard Tutz,et al.  Boosting ridge regression , 2007, Comput. Stat. Data Anal..

[10]  Harold I Feldman,et al.  Individual patient‐ versus group‐level data meta‐regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head , 2002, Statistics in medicine.

[11]  Paul R. Burton,et al.  DataSHIELD - shared individual-level analysis without sharing the data: a biostatistical perspective. , 2012 .

[12]  Santiago Rodríguez,et al.  HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics , 2016, Bioinform..

[13]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[14]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[15]  Benedikt Fecher,et al.  What Drives Academic Data Sharing? , 2014, PloS one.

[16]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[17]  Qianchuan He,et al.  Sparse meta-analysis with high-dimensional data. , 2016, Biostatistics.

[18]  Orestis Efthimiou,et al.  Get real in individual participant data (IPD) meta‐analysis: a review of the methodology , 2015, Research synthesis methods.

[19]  Murat Kantarcioglu,et al.  A secure distributed logistic regression protocol for the detection of rare adverse drug events , 2012, J. Am. Medical Informatics Assoc..

[20]  P. Burton,et al.  Securing the Data Economy: Translating Privacy and Enacting Security in the Development of DataSHIELD , 2012, Public Health Genomics.

[21]  Wei Xie,et al.  Supporting Regularized Logistic Regression Privately and Efficiently , 2015, PloS one.

[22]  Jennifer R. Harris,et al.  DataSHIELD: An Ethically Robust Solution to Multiple-Site Individual-Level Data Analysis , 2014, Public Health Genomics.

[23]  Iris Pigeot,et al.  Consent and confidentiality in the light of recent demands for data sharing , 2017, Biometrical journal. Biometrische Zeitschrift.

[24]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[25]  Gaurav Bhatia,et al.  Fast and accurate imputation of summary statistics enhances evidence of functional enrichment , 2013, Bioinform..

[26]  Axel Benner,et al.  Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling , 2016, PloS one.

[27]  Xiaoqian Jiang,et al.  Secure Multi-pArty Computation Grid LOgistic REgression (SMAC-GLORE) , 2016, BMC Medical Informatics and Decision Making.

[28]  G. Tutz,et al.  Generalized Additive Modeling with Implicit Variable Selection by Likelihood‐Based Boosting , 2006, Biometrics.

[29]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[30]  Valerie Obenchain,et al.  Risk prediction using genome‐wide association studies , 2010, Genetic epidemiology.

[31]  Markus Perola,et al.  Data harmonization and federated analysis of population-based studies: the BioSHaRE project , 2013, Emerging Themes in Epidemiology.

[32]  Oliver Butters,et al.  DataSHIELD: taking the analysis to the data, not the data to the analysis , 2014, International journal of epidemiology.

[33]  M Schumacher,et al.  Tailoring sparse multivariable regression techniques for prognostic single‐nucleotide polymorphism signatures , 2013, Statistics in medicine.

[34]  Reddy Rani Vangimalla,et al.  Integrative regression network for genomic association study , 2016, BMC Medical Genomics.

[35]  Xiaoqian Jiang,et al.  WebDISCO: a web service for distributed cox model learning without patient-level data sharing , 2015, J. Am. Medical Informatics Assoc..

[36]  Murat Kantarcioglu,et al.  SecureMA: protecting participant privacy in genetic association , 2014 .

[37]  P. Visscher,et al.  Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits , 2012, Nature Genetics.

[38]  Yakir A Reshef,et al.  Partitioning heritability by functional annotation using genome-wide association summary statistics , 2015, Nature Genetics.

[39]  B. Pierce,et al.  Efficient Design for Mendelian Randomization Studies: Subsample and 2-Sample Instrumental Variable Estimators , 2013, American journal of epidemiology.