Privacy-Preserving Genotype Imputation in a Trusted Execution Environment

Genotype imputation is an essential tool in genetics research, whereby missing genotypes are inferred based on a panel of reference genomes to enhance the power of downstream analyses. Recently, public imputation servers have been developed to allow researchers to leverage increasingly large-scale and diverse genetic data repositories for imputation. However, privacy concerns associated with uploading one’s genetic data to a third-party server greatly limit the utility of these services. In this paper, we introduce a practical, secure hardware-based solution for a privacy-preserving imputation service, which keeps the input genomes private from the service provider by processing the data only within a Trusted Execution Environment (TEE) offered by the Intel SGX technology. Our solution features SMac, an efficient, side-channel-resilient imputation algorithm designed for Intel SGX, which employs the hidden Markov model (HMM)-based imputation strategy also utilized by a state-of-the-art imputation software Minimac. SMac achieves imputation accuracies virtually identical to those of Minimac and provides protection against known attacks on SGX while maintaining scalability to large datasets. We additionally show the necessity of our strategies for mitigating side-channel risks by identifying vulnerabilities in existing imputation software and controlling their information exposure. Overall, our work provides a guideline for practical and secure implementation of genetic analysis tools in SGX, representing a step toward privacy-preserving analysis services that can facilitate data sharing and accelerate genetics research.† Availability Our software is available at https://github.com/ndokmai/sgx-genotype-imputation.

[1]  Srdjan Capkun,et al.  Software Grand Exposure: SGX Cache Attacks Are Practical , 2017, WOOT.

[2]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[3]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[4]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[6]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[7]  Thomas F. Wenisch,et al.  Foreshadow-NG: Breaking the virtual memory abstraction with transient out-of-order execution , 2018 .

[8]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[9]  Eduardo Chielle,et al.  Privacy-preserving genotype imputation with fully homomorphic encryption , 2020, bioRxiv.

[10]  L Kruglyak,et al.  Efficient multipoint linkage analysis through reduction of inheritance space. , 2001, American journal of human genetics.

[11]  Gernot Heiser,et al.  Last-Level Cache Side-Channel Attacks are Practical , 2015, 2015 IEEE Symposium on Security and Privacy.

[12]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[13]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[14]  Daniel Gruss,et al.  PLATYPUS: Software-based Power Side-Channel Attacks on x86 , 2021, 2021 IEEE Symposium on Security and Privacy (SP).

[15]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[16]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[17]  Yun S. Song,et al.  Blockwise HMM computation for large-scale population genomic inference , 2012, Bioinform..

[18]  Can Alkan,et al.  On genomic repeats and reproducibility , 2016, Bioinform..

[19]  Josep Torrellas,et al.  MicroScope: Enabling Microarchitectural Replay Attacks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[20]  Berk Sunar,et al.  LVI: Hijacking Transient Execution through Microarchitectural Load Value Injection , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[21]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[22]  Berk Sunar,et al.  Tate Pairing with Strong Fault Resiliency , 2007 .

[23]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[24]  Frank Piessens,et al.  Fallout: Leaking Data on Meltdown-resistant CPUs , 2019, CCS.

[25]  Sayantani Das Next Generation of Genotype Imputation Methods , 2017 .

[26]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[27]  Carl A. Gunter,et al.  Leaky Cauldron on the Dark Land: Understanding Memory Side-Channel Hazards in SGX , 2017, CCS.

[28]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[29]  Gonçalo R. Abecasis,et al.  Minimac2: Faster Genotype Imputation , 2015, Bioinform..

[30]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[31]  Herbert Bos,et al.  RIDL: Rogue In-Flight Data Load , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[32]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[33]  Jihoon Kim,et al.  PRINCESS: Privacy‐protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS , 2017, Bioinform..

[34]  Brian E. Cade,et al.  Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program , 2019, Nature.

[35]  Jung Hee Cheon,et al.  Ultra-Fast Homomorphic Encryption Models enable Secure Outsourcing of Genotype Imputation , 2020, bioRxiv.

[36]  Sorin Lerner,et al.  On Subnormal Floating Point and Abnormal Timing , 2015, 2015 IEEE Symposium on Security and Privacy.

[37]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[38]  Xiaoqian Jiang,et al.  PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension , 2017, BMC Medical Genomics.

[39]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[40]  David P. Woodruff,et al.  Sketching algorithms for genomic data analysis and querying in a secure enclave , 2020, Nature Methods.

[41]  Daniel Gruss,et al.  ZombieLoad: Cross-Privilege-Boundary Data Sampling , 2019, CCS.

[42]  Cesar Pereida García,et al.  Port Contention for Fun and Profit , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[43]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[44]  Thomas F. Wenisch,et al.  Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution , 2018, USENIX Security Symposium.

[45]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[46]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[47]  Jean-Pierre Seifert,et al.  Cheap Hardware Parallelism Implies Cheap Security , 2007 .