Privacy-Preserving Compressed Reference-Oriented Alignment Map Using Decentralized Storage

In bioinformatics, researchers have endeavored to resolve the following two issues: 1) how to increase the efficiency of storage through compression and 2) how to provide confidentiality for the genome sequence data. To resolve two issues, the sequence alignment map, the binary alignment map, the compressed reference-oriented alignment map (CRAM), and the selective retrieval on encrypted and CRAM formats were proposed. However, since these formats are stored in a centralized storage that is managed by the genome testing organizations, the privacy of sensitive genome sequence data is not guaranteed. In this paper, we propose a new compressed reference-oriented alignment map, called decentralized storage and compressed reference-oriented alignment map (D-RAM), which preserves the privacy of genome sequence data using a decentralized storage architecture. The proposed D-RAM format uses the reference-based compression and $bzip2$ compression to use storage space efficiently. In addition, to preserve the privacy of genome sequence data, the proposed decentralized storage architecture is designed to store the private genome sequence data and the public genome sequence data separately. From the experimental results under simulation and real genome sequence data, we show that the D-RAM format saves the size of the genome sequence data than other formats. By analyzing the computational complexity with which the attacker recovers the genome sequence data, we also show the theoretical analysis that explains why the D-RAM format is safer than the other formats.

[1]  Zhen Ji,et al.  High-throughput DNA sequence data compression , 2015, Briefings Bioinform..

[2]  Mahdi Imani,et al.  State-feedback control of Partially-Observed Boolean Dynamical Systems using RNA-seq time series data , 2016, 2016 American Control Conference (ACC).

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[5]  Adam Molyneaux,et al.  Privacy-Preserving Processing of Raw Genomic Data , 2013, DPM/SETOP.

[6]  Elizabeth Pennisi Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[7]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[8]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Ramakrishnan Srikant,et al.  Order preserving encryption for numeric data , 2004, SIGMOD '04.

[11]  Low Tang Jung,et al.  From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review , 2017, Biomed. Signal Process. Control..

[12]  Florian Kerschbaum,et al.  Frequency-Hiding Order-Preserving Encryption , 2015, CCS.

[13]  Simon White,et al.  Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline , 2014, BMC Bioinformatics.

[14]  Devendra Prasad,et al.  An improved method for DNA sequence compression , 2017, 2017 2nd International Conference on Telecommunication and Networks (TEL-NET).

[15]  Wan-Chi Siu,et al.  Clustering-Based Compression for Population DNA Sequences , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Zhicong Huang,et al.  A privacy-preserving solution for compressed storage and selective retrieval of genomic data , 2016, Genome research.

[17]  Nathan Chenette,et al.  Order-Preserving Encryption Revisited: Improved Security Analysis and Alternative Solutions , 2011, CRYPTO.

[18]  Edward M. Rubin,et al.  The future of DNA sequencing , 2017, Nature.

[19]  Jihoon Kim,et al.  HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads , 2013, J. Am. Medical Informatics Assoc..

[20]  John Chilton,et al.  Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory , 2013, BMC Research Notes.