Secure genome-wide association analysis using multiparty computation

Most sequenced genomes are currently stored in strict access-controlled repositories. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets. However, concerns over genetic data privacy may deter individuals from contributing their genomes to scientific studies and could prevent researchers from sharing data with the scientific community. Although cryptographic techniques for secure data analysis exist, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.

[1]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[2]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[3]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[4]  Thomas A Trikalinos,et al.  Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. , 2006, American journal of epidemiology.

[5]  David Evans,et al.  Two Halves Make a Whole - Reducing Data Transfer in Garbled Circuits Using Half Gates , 2015, EUROCRYPT.

[6]  Steven E. Brenner Be prepared for the big genome leak , 2013, Nature.

[7]  Andrew Chi-Chih Yao,et al.  Protocols for Secure Computations (Extended Abstract) , 1982, FOCS.

[8]  Stephan Ripke,et al.  Association of granulomatosis with polyangiitis (Wegener's) with HLA-DPB1*04 and SEMA6A gene variants: evidence from genome-wide analysis. , 2013, Arthritis and rheumatism.

[9]  Frederik Vercauteren,et al.  Privacy-Preserving Genome-Wide Association Study is Practical , 2017, IACR Cryptol. ePrint Arch..

[10]  William Wheeler,et al.  Genome-wide association study identifies multiple loci associated with bladder cancer risk. , 2014, Human molecular genetics.

[11]  B. Qian,et al.  Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia , 2012, Nature Genetics.

[12]  Jun Sakuma,et al.  Efficient Secure Outsourcing of Genome-Wide Association Studies , 2015, 2015 IEEE Security and Privacy Workshops.

[13]  Nicolas Gama,et al.  Faster Fully Homomorphic Encryption: Bootstrapping in Less Than 0.1 Seconds , 2016, ASIACRYPT.

[14]  John P A Ioannidis,et al.  Required sample size and nonreplicability thresholds for heterogeneous genetic associations , 2008, Proceedings of the National Academy of Sciences.

[15]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[16]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[17]  Yara T. E. Lechanteur,et al.  Nature Genetics Advance Online Publication , 2022 .

[18]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[19]  Michael Naehrig,et al.  Private Predictive Analysis on Encrypted Medical Data , 2014, IACR Cryptol. ePrint Arch..

[20]  Xiaoqian Jiang,et al.  Privacy-preserving GWAS analysis on federated genomic datasets , 2015, BMC Medical Informatics and Decision Making.

[21]  Octavian Catrina,et al.  Secure Computation with Fixed-Point Numbers , 2010, Financial Cryptography.

[22]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[23]  M. Gerstein,et al.  Quantification of private information leakage from phenotype-genotype data: linking attacks , 2016, Nature Methods.

[24]  Avi Wigderson,et al.  Completeness theorems for non-cryptographic fault-tolerant distributed computation , 1988, STOC '88.

[25]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[26]  Yuchen Zhang,et al.  HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS , 2015, Bioinform..

[27]  Marcus Peinado,et al.  Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing , 2016, USENIX Security Symposium.

[28]  Peter Kraft,et al.  Quality control and quality assurance in genotypic data for genome‐wide association studies , 2010, Genetic epidemiology.

[29]  Peter W. Markstein,et al.  Software Division and Square Root Using Goldschmidt's Algorithms , 2004 .

[30]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[31]  S. Gabriel,et al.  Assessing the impact of population stratification on genetic association studies , 2004, Nature Genetics.

[32]  Mariana Raykova,et al.  Outsourcing Multi-Party Computation , 2011, IACR Cryptol. ePrint Arch..

[33]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[34]  I. Damglurd Unconditionally secure constant-rounds multi-party computation for equality, comparison, bits and exponentiation , 2006 .

[35]  Dan Bogdanov,et al.  Sharemind: A Framework for Fast Privacy-Preserving Computations , 2008, ESORICS.

[36]  Kadija Ferryman,et al.  Motivations, concerns and preferences of personal genome sequencing research participants: Baseline findings from the HealthSeq project , 2015, European Journal of Human Genetics.

[37]  R. Collins,et al.  China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. , 2011, International journal of epidemiology.

[38]  Masato Kimura,et al.  NCBI’s Database of Genotypes and Phenotypes: dbGaP , 2013, Nucleic Acids Res..

[39]  Tomas Toft,et al.  On Secure Two-Party Integer Division , 2012, Financial Cryptography.

[40]  Ivan Damgård,et al.  Scalable and Unconditionally Secure Multiparty Computation , 2007, CRYPTO.

[41]  Ivan Damgård,et al.  Semi-Homomorphic Encryption and Multiparty Computation , 2011, IACR Cryptol. ePrint Arch..

[42]  Ivan Damgård,et al.  Multiparty Computation from Somewhat Homomorphic Encryption , 2012, IACR Cryptol. ePrint Arch..

[43]  Kazuo Ohta,et al.  Multiparty Computation for Interval, Equality, and Comparison Without Bit-Decomposition Protocol , 2007, Public Key Cryptography.

[44]  Marcel Keller,et al.  MASCOT: Faster Malicious Arithmetic Secure Computation with Oblivious Transfer , 2016, IACR Cryptol. ePrint Arch..

[45]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[46]  Marcus Peinado,et al.  Controlled-Channel Attacks: Deterministic Side Channels for Untrusted Operating Systems , 2015, 2015 IEEE Symposium on Security and Privacy.

[47]  Richard J. Lipton,et al.  Lower Bounds for Constant Depth Circuits for Prefix Problems , 1983, ICALP.

[48]  Claudio Orlandi,et al.  A New Approach to Practical Active-Secure Two-Party Computation , 2012, IACR Cryptol. ePrint Arch..

[49]  Robert Cook-Deegan,et al.  Beyond Our Borders? Public Resistance to Global Genomic Data Sharing , 2016, PLoS biology.

[50]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[51]  Oded Goldreich,et al.  The Foundations of Cryptography - Volume 1: Basic Techniques , 2001 .

[52]  Dan Bogdanov,et al.  A new way to protect privacy in large-scale genome-wide association studies , 2013, Bioinform..

[53]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[54]  Dan Bogdanov,et al.  Implementation and Evaluation of an Algorithm for Cryptographically Private Principal Component Analysis on Genomic Data , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Chien-Chung Lin,et al.  Interactions between household air pollution and GWAS-identified lung cancer susceptibility markers in the Female Lung Cancer Consortium in Asia (FLCCA) , 2015, Human Genetics.

[56]  Tai-Lin Wang Convergence of the tridiagonal QR algorithm , 2001 .

[57]  Bonnie Berger,et al.  Enabling Privacy Preserving GWAS in Heterogeneous Human Populations , 2016, RECOMB.

[58]  Kannan Balasubramanian,et al.  Secure Multiparty Computation , 2011, Encyclopedia of Cryptography and Security.

[59]  Payman Mohassel,et al.  SecureML: A System for Scalable Privacy-Preserving Machine Learning , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[60]  James M. Ortega,et al.  The LLT and QR methods for symmetric tridiagonal matrices , 1963, Comput. J..

[61]  Marcel Keller,et al.  Practical Covertly Secure MPC for Dishonest Majority - Or: Breaking the SPDZ Limits , 2013, ESORICS.

[62]  Xiaoqian Jiang,et al.  A community assessment of privacy preserving techniques for human genomes , 2014, BMC Medical Informatics and Decision Making.

[63]  Dan Boneh,et al.  Deriving genomic diagnoses without revealing patient genomes , 2017, Science.

[64]  Donald Beaver,et al.  Efficient Multiparty Protocols Using Circuit Randomization , 1991, CRYPTO.

[65]  P. Sparén,et al.  Urinary bladder cancer in Wegener’s granulomatosis: risks and relation to cyclophosphamide , 2004, Annals of the rheumatic diseases.