Ultra-Fast Homomorphic Encryption Models enable Secure Outsourcing of Genotype Imputation

Genotype imputation is a fundamental step in genomic data analysis such as GWAS, where missing variant genotypes are predicted using the existing genotypes of nearby ‘tag’ variants. Imputation greatly decreases the genotyping cost and provides high-quality estimates of common variant genotypes. As population panels increase, e.g., the TOPMED Project, genotype imputation is becoming more accurate, but it requires high computational power. Although researchers can outsource genotype imputation, privacy concerns may prohibit genetic data sharing with an untrusted imputation service. To address this problem, we developed the first fully secure genotype imputation by utilizing ultra-fast homomorphic encryption (HE) techniques that can evaluate millions of imputation models in seconds. In HE-based methods, the genotype data is end-to-end encrypted, i.e., encrypted in transit, at rest, and, most importantly, in analysis, and can be decrypted only by the data owner. We compared secure imputation with three other state-of-the-art non-secure methods under different settings. We found that HE-based methods provide full genetic data security with comparable or slightly lower accuracy. In addition, HE-based methods have time and memory requirements that are comparable and even lower than the non-secure methods. We provide five different implementations and workflows that make use of three cutting-edge HE schemes (BFV, CKKS, TFHE) developed by the top contestants of the iDASH19 Genome Privacy Challenge. Our results provide strong evidence that HE-based methods can practically perform resource-intensive computations for high throughput genetic data analysis. In addition, the publicly available codebases provide a reference for the development of secure genomic data analysis methods.

[1]  Nicolas Gama,et al.  CHIMERA: Combining Ring-LWE-based Fully Homomorphic Encryption Schemes , 2020, J. Math. Cryptol..

[2]  Xinghua Shi,et al.  Sparse Convolutional Denoising Autoencoders for Genotype Imputation , 2019, Genes.

[3]  Scott T. Weiss,et al.  Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations , 2019, bioRxiv.

[4]  Hyunghoon Cho,et al.  Emerging technologies towards enhancing privacy in genomic data sharing , 2019, Genome Biology.

[5]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[6]  Y. Bossé,et al.  Benefits and limitations of genome-wide association studies , 2019, Nature Reviews Genetics.

[7]  Brian E. Cade,et al.  Whole-Genome Rare-Variant Association Analyses of Sleep-Disordered Breathing Traits in the NHLBI Trans-Omics in Precision Medicine (TOPMed) Consortium , 2019 .

[8]  Nicolas Gama,et al.  TFHE: Fast Fully Homomorphic Encryption Over the Torus , 2019, Journal of Cryptology.

[9]  Brian E. Cade,et al.  Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program , 2019, Nature.

[10]  C. Hoofnagle,et al.  The European Union general data protection regulation: what it is and what it means* , 2019, Information & Communications Technology Law.

[11]  A. Harmanci,et al.  Detecting and Annotating Rare Variants , 2019, Encyclopedia of Bioinformatics and Computational Biology.

[12]  David P. Woodruff,et al.  Sketching algorithms for genomic data analysis and querying in a secure enclave , 2018, bioRxiv.

[13]  Brian L Browning,et al.  A One-Penny Imputed Genome from Next-Generation Reference Panels. , 2018, American journal of human genetics.

[14]  Brian L Browning,et al.  Genotype Imputation from Large Reference Panels. , 2018, Annual review of genomics and human genetics.

[15]  D. Schaid,et al.  From genome-wide associations to candidate causal variants by statistical fine-mapping , 2018, Nature Reviews Genetics.

[16]  Fernando Pires Hartwig,et al.  A Large-Scale Multi-ancestry Genome-wide Study Accounting for Smoking Behavior Identifies Multiple Significant Loci for Blood Pressure. , 2018, American journal of human genetics.

[17]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[18]  Jenny C. Taylor,et al.  Are whole-exome and whole-genome sequencing approaches cost-effective? A systematic review of the literature , 2018, Genetics in Medicine.

[19]  Jung Hee Cheon,et al.  Homomorphic Encryption for Arithmetic of Approximate Numbers , 2017, ASIACRYPT.

[20]  Euan A Ashley,et al.  Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis , 2017, American journal of epidemiology.

[21]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[22]  Klaudia Walter,et al.  The impact of rare and low-frequency genetic variants in common disease , 2017, Genome Biology.

[23]  H. Rehm Evolving health care through personal genomics , 2017, Nature Reviews Genetics.

[24]  Michael Naehrig,et al.  Manual for Using Homomorphic Encryption for Bioinformatics , 2017, Proceedings of the IEEE.

[25]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[26]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[27]  B. Chain,et al.  The sequence of sequencers: The history of sequencing DNA , 2016, Genomics.

[28]  Martin R. Albrecht,et al.  On the concrete hardness of Learning with Errors , 2015, J. Math. Cryptol..

[29]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[30]  Subodh Gangan,et al.  A Review of Man-in-the-Middle Attacks , 2015, ArXiv.

[31]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[32]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[33]  Jason Flannick,et al.  Evaluating empirical bounds on complex disease genetic architecture , 2013, Nature Genetics.

[34]  J. Ioannidis,et al.  Meta-analysis methods for genome-wide association studies and beyond , 2013, Nature Reviews Genetics.

[35]  Yun Li,et al.  A comprehensive SNP and indel imputability database , 2013, Bioinform..

[36]  L. Bierut,et al.  Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy , 2013, Human Genetics.

[37]  Chris Peikert,et al.  On Ideal Lattices and Learning with Errors over Rings , 2010, JACM.

[38]  Zvika Brakerski,et al.  Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP , 2012, CRYPTO.

[39]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[40]  Greg Gibson,et al.  Rare and common variants: twenty arguments , 2012, Nature Reviews Genetics.

[41]  Frederik Vercauteren,et al.  Somewhat Practical Fully Homomorphic Encryption , 2012, IACR Cryptol. ePrint Arch..

[42]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[43]  Simon Cawley,et al.  Next generation genome-wide association tool: design and coverage of a high-throughput European-optimized SNP array. , 2011, Genomics.

[44]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[45]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[46]  Xihong Lin,et al.  Rare Variant Association Testing for Sequencing Data Using the Sequence Kernel Association Test ( SKAT ) , 2011 .

[47]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[48]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[49]  Pauline C Ng,et al.  Whole genome sequencing. , 2010, Methods in molecular biology.

[50]  Helen Nissenbaum,et al.  Privacy in Context , 2009 .

[51]  Helen Nissenbaum,et al.  Privacy in Context - Technology, Policy, and the Integrity of Social Life , 2009 .

[52]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[53]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[54]  P. Visscher,et al.  On Jim Watson's APOE status: genetic information is hard to hide , 2009, European Journal of Human Genetics.

[55]  Vincent Plagnol,et al.  Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci , 2008, Nature Genetics.

[56]  Zhaoxia Yu,et al.  Methods to impute missing genotypes for population data , 2007, Human Genetics.

[57]  Jennifer Fisher Wilson,et al.  Health Insurance Portability and Accountability Act Privacy Rule Causes Ongoing Concerns among Clinicians and Researchers , 2006, Annals of Internal Medicine.

[58]  D. Holdstock Past, present--and future? , 2005, Medicine, conflict, and survival.

[59]  Daniel O Stram,et al.  Tag SNP selection for association studies , 2004, Genetic epidemiology.

[60]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.