Privacy-preserving construction of generalized linear mixed model for biomedical computation

Abstract Motivation The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation–Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. Results Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package. Availability and implementation The software is released in open source at https://github.com/huthvincent/cGLMM. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Brooke L. Fridley,et al.  GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer , 2013, Nature Genetics.

[2]  W. Stroup Generalized Linear Mixed Models: Modern Concepts, Methods and Applications , 2012 .

[3]  Yuchen Zhang,et al.  HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS , 2015, Bioinform..

[4]  N. Sharpless,et al.  Review: a meta‐analysis of GWAS and age‐associated diseases , 2012, Aging cell.

[5]  Jihoon Kim,et al.  Grid Binary LOgistic REgression (GLORE): building shared models without sharing data , 2012, J. Am. Medical Informatics Assoc..

[6]  Saharon Rosset,et al.  Mixed Models for Case-Control Genome-Wide Association Studies: Major Challenges and Partial Solutions , 2018, Handbook of Statistical Methods for Case-Control Studies.

[7]  Xiaoqian Jiang,et al.  EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning , 2013, J. Biomed. Informatics.

[8]  Nilanjan Chatterjee,et al.  Handbook of Statistical Methods for Case-Control Studies , 2018 .

[9]  Xiaoqian Jiang,et al.  Secure Logistic Regression Based on Homomorphic Encryption: Design and Evaluation , 2018, IACR Cryptol. ePrint Arch..

[10]  Rutvij H. Jhaveri,et al.  Survey of Various Homomorphic Encryption algorithms and Schemes , 2014 .

[11]  Bernhard Wieneke,et al.  Collaborative framework for PIV uncertainty quantification: comparative assessment of methods , 2015 .

[12]  S. Love,et al.  Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods , 2003, British Journal of Cancer.

[13]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[14]  Alexandros Iosifidis,et al.  2015 IEEE Trustcom/BigDataSE/ISPA , 2016, Big Data 2016.

[15]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[16]  J. Booth,et al.  Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm , 1999 .

[17]  Xiaoqian Jiang,et al.  PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension , 2017, BMC Medical Genomics.

[18]  Xiaoqian Jiang,et al.  PREMIX: PRivacy-preserving EstiMation of Individual admiXture , 2016, AMIA.

[19]  Xiaoqian Jiang,et al.  WebGLORE: a Web service for Grid LOgistic REgression , 2013, Bioinform..

[20]  Xiaoqian Jiang,et al.  iDASH secure genome analysis competition 2017 , 2018, BMC Medical Genomics.

[21]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[22]  Xiaoqian Jiang,et al.  VERTIcal Grid lOgistic regression (VERTIGO) , 2016, J. Am. Medical Informatics Assoc..

[23]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[24]  Rebecca N. Wright,et al.  Privacy-preserving imputation of missing data , 2008, Data Knowl. Eng..

[25]  Jihoon Kim,et al.  PRINCESS: Privacy‐protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS , 2017, Bioinform..

[26]  Xiaoqian Jiang,et al.  WebDISCO: a web service for distributed cox model learning without patient-level data sharing , 2015, J. Am. Medical Informatics Assoc..