A Secure Alignment Algorithm for Mapping Short Reads to Human Genome

The elastic and inexpensive computing resources such as clouds have been recognized as a useful solution to analyzing massive human genomic data (e.g., acquired by using next-generation sequencers) in biomedical researches. However, outsourcing human genome computation to public or commercial clouds was hindered due to privacy concerns: even a small number of human genome sequences contain sufficient information for identifying the donor of the genomic data. This issue cannot be directly addressed by existing security and cryptographic techniques (such as homomorphic encryption), because they are too heavyweight to carry out practical genome computation tasks on massive data. In this article, we present a secure algorithm to accomplish the read mapping, one of the most basic tasks in human genomic data analysis based on a hybrid cloud computing model. Comparing with the existing approaches, our algorithm delegates most computation to the public cloud, while only performing encryption and decryption on the private cloud, and thus makes the maximum use of the computing resource of the public cloud. Furthermore, our algorithm reports similar results as the nonsecure read mapping algorithms, including the alignment between reads and the reference genome, which can be directly used in the downstream analysis such as the inference of genomic variations. We implemented the algorithm in C++ and Python on a hybrid cloud system, in which the public cloud uses an Apache Spark system.

[1]  M. Gerstein,et al.  Quantification of private information leakage from phenotype-genotype data: linking attacks , 2016, Nature Methods.

[2]  Rafail Ostrovsky,et al.  Software protection and simulation on oblivious RAMs , 1996, JACM.

[3]  Wenliang Du,et al.  Secure and private sequence comparisons , 2003, WPES '03.

[4]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[5]  L. Glantz,et al.  Drafting the Genetic Privacy Act: Science, Policy, and Practical Considerations , 1995, Journal of Law, Medicine & Ethics.

[6]  Manfred Kayser,et al.  Forensic DNA Phenotyping: Predicting human appearance from crime scene material for investigative purposes. , 2015, Forensic science international. Genetics.

[7]  Xiaoqian Jiang,et al.  A community assessment of privacy preserving techniques for human genomes , 2014, BMC Medical Informatics and Decision Making.

[8]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[9]  Jinghui Zhang,et al.  Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data , 2009, PLoS genetics.

[10]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[11]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[12]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[13]  Bo Peng,et al.  Large-Scale Privacy-Preserving Mapping of Human Genomic Sequences on Hybrid Clouds , 2012, NDSS.

[14]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[15]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[16]  Caroline Fontaine,et al.  A Survey of Homomorphic Encryption for Nonspecialists , 2007, EURASIP J. Inf. Secur..

[17]  Yehuda Lindell,et al.  Introduction to Modern Cryptography , 2004 .

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[19]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[20]  Russ B Altman,et al.  Confidentiality in Genome Research , 2006, Science.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[23]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[24]  Michael Snyder,et al.  Secure cloud computing for genomic data , 2016, Nature Biotechnology.

[25]  Serafim Batzoglou,et al.  A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy , 2017, Nature Communications.