Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption

The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.

[1]  Agustí Verde Parera,et al.  General data protection regulation , 2018 .

[2]  A. Zwinderman,et al.  Multiple Imputation of Missing Genotype Data for Unrelated Individuals , 2006, Annals of human genetics.

[3]  Michael P. Wellman,et al.  SoK: Security and Privacy in Machine Learning , 2018, 2018 IEEE European Symposium on Security and Privacy (EuroS&P).

[4]  L. Bierut,et al.  Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy , 2013, Human Genetics.

[5]  M. Maniatakos,et al.  E3: A Framework for Compiling C++ Programs with Encrypted Operands , 2018, IACR Cryptol. ePrint Arch..

[6]  Ha T. Lam,et al.  Encryption Performance Improvements of the Paillier Cryptosystem , 2015, IACR Cryptol. ePrint Arch..

[7]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[8]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[9]  Nicolas Gama,et al.  Faster Packed Homomorphic Operations and Efficient Circuit Bootstrapping for TFHE , 2017, ASIACRYPT.

[10]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[11]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[12]  Emmanuela Orsini,et al.  Between a Rock and a Hard Place: Interpolating Between MPC and FHE , 2013, IACR Cryptol. ePrint Arch..

[13]  Yang Liu,et al.  BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning , 2020, USENIX ATC.

[14]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[15]  Tolga Soyata,et al.  Medical Data Analytics in the Cloud Using Homomorphic Encryption , 2016 .

[16]  Michael Naehrig,et al.  Private Predictive Analysis on Encrypted Medical Data , 2014, IACR Cryptol. ePrint Arch..

[17]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[18]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[19]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[20]  Eduardo Chielle,et al.  CoPHEE: Co-processor for Partially Homomorphic Encrypted Execution , 2019, 2019 IEEE International Symposium on Hardware Oriented Security and Trust (HOST).

[21]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[22]  Yisong Yue,et al.  NAOMI: Non-Autoregressive Multiresolution Sequence Imputation , 2019, NeurIPS.

[23]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[24]  H. Rehm Evolving health care through personal genomics , 2017, Nature Reviews Genetics.

[25]  Yao Lu,et al.  Oblivious Neural Network Predictions via MiniONN Transformations , 2017, IACR Cryptol. ePrint Arch..

[26]  Elaine B. Barker,et al.  Recommendation for key management: , 2019 .

[27]  Zhenglin Du,et al.  Comprehensive Assessment of Genotype Imputation Performance , 2019, Human Heredity.

[28]  Johannes Söding,et al.  Bayesian multiple logistic regression for case-control GWAS , 2018, PLoS genetics.

[29]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[30]  Michael Eisenstein,et al.  Big data: The power of petabytes , 2015, Nature.

[31]  Delaram Kahrobaei,et al.  Homomorphic Encryption for Machine Learning in Medicine and Bioinformatics , 2020, ACM Comput. Surv..

[32]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[33]  Michael Naehrig,et al.  CryptoNets: applying neural networks to encrypted data with high throughput and accuracy , 2016, ICML 2016.