Private Genomes and Public SNPs: Homomorphic Encryption of Genotypes and Phenotypes for Shared Quantitative Genetics

Mott et al. show that association between a quantitative trait and genotype can be performed using data that has been transformed by first rotating it in a high-dimensional space. The resulting... Sharing human genotype and phenotype data is essential to discover otherwise inaccessible genetic associations, but is a challenge because of privacy concerns. Here, we present a method of homomorphic encryption that obscures individuals’ genotypes and phenotypes, and is suited to quantitative genetic association analysis. Encrypted ciphertext and unencrypted plaintext are analytically interchangeable. The encryption uses a high-dimensional random linear orthogonal transformation key that leaves the likelihood of quantitative trait data unchanged under a linear model with normally distributed errors. It also preserves linkage disequilibrium between genetic variants and associations between variants and phenotypes. It scrambles relationships between individuals: encrypted genotype dosages closely resemble Gaussian deviates, and can be replaced by quantiles from a Gaussian with negligible effects on accuracy. Likelihood-based inferences are unaffected by orthogonal encryption. These include linear mixed models to control for unequal relatedness between individuals, heritability estimation, and including covariates when testing association. Orthogonal transformations can be applied in a modular fashion for multiparty federated mega-analyses where the parties first agree to share a common set of genotype sites and covariates prior to encryption. Each then privately encrypts and shares their own ciphertext, and analyses all parties’ ciphertexts. In the absence of private variants, or knowledge of the key, we show that it is infeasible to decrypt ciphertext using existing brute-force or noise-reduction attacks. We present the method as a challenge to the community to determine its security.

[1]  Y. Moreau,et al.  Towards practical privacy-preserving genome-wide association study , 2018, BMC Bioinformatics.

[2]  Kay Hamacher,et al.  Large-Scale Privacy-Preserving Statistical Computations for Distributed Genome-Wide Association Studies , 2018, AsiaCCS.

[3]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[4]  Chloé-Agathe Azencott,et al.  Machine learning and genomics: precision medicine versus patient privacy , 2018, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[5]  Dan Boneh,et al.  Deriving genomic diagnoses without revealing patient genomes , 2017, Science.

[6]  Na Cai,et al.  11,670 whole-genome sequences representative of the Han Chinese population from the CONVERGE project , 2017, Scientific Data.

[7]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[8]  George Hripcsak,et al.  Preserving temporal relations in clinical data while maintaining privacy , 2016, J. Am. Medical Informatics Assoc..

[9]  Steve D. M. Brown,et al.  Genome-wide association of multiple complex traits in outbred mice by ultra low-coverage sequencing , 2016, Nature Genetics.

[10]  Simon Woods,et al.  The risk of re-identification versus the need to identify individuals in rare disease research , 2016, European Journal of Human Genetics.

[11]  Yuchen Zhang,et al.  HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS , 2015, Bioinform..

[12]  Warren W. Kretzschmar,et al.  Sparse whole genome sequencing identifies two loci for major depressive disorder , 2015, Nature.

[13]  Wotao Yin,et al.  A feasible method for optimization with orthogonality constraints , 2013, Math. Program..

[14]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[15]  Dimitris Bertsimas,et al.  Nonconvex Robust Optimization for Problems with Constraints , 2010, INFORMS J. Comput..

[16]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[17]  Peter D. Hoff,et al.  Simulation of the Matrix Bingham–von Mises–Fisher Distribution, With Applications to Multivariate and Relational Data , 2007, 0712.4166.

[18]  Dima Grigoriev,et al.  Polynomial-time computing over quadratic maps i: sampling in real algebraic sets , 2004, computational complexity.

[19]  Aapo Hyvärinen,et al.  A Fast Fixed-Point Algorithm for Independent Component Analysis , 1997, Neural Computation.

[20]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[21]  J. T. Barr What is the risk? , 1991, Risk analysis : an official publication of the Society for Risk Analysis.

[22]  T. W. Anderson,et al.  Generation of random orthogonal matrices , 1987 .