A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets

DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.

[1]  E. F. Moore,et al.  Variable-length binary encodings , 1959 .

[2]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[3]  C. Mora,et al.  How Many Species Are There on Earth and in the Ocean? , 2011, PLoS biology.

[4]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  P. Gill,et al.  Encoded evidence: DNA in forensic analysis , 2004, Nature Reviews Genetics.

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[9]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[10]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.0 , 1996, RFC.

[11]  Jon Postel,et al.  File Transfer Protocol , 1985, RFC.

[12]  Lei Chen,et al.  Compressed pattern matching in DNA sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[13]  Frederick J. Beutler,et al.  The Spectral Analysis of Impulse Processes , 1968, Inf. Control..

[14]  Saul Gorn,et al.  American standard code for information interchange , 1963, CACM.

[15]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[16]  Elias Campo Guerri,et al.  International network of cancer genome projects , 2010 .

[17]  Fahad Saeed,et al.  A Variable-Length Network Encoding Protocol for Big Genomic Data , 2016, WWIC.

[18]  F. Beutler,et al.  The theory of stationary point processes , 1966 .

[19]  M. Rudelson,et al.  Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements , 2006, 2006 40th Annual Conference on Information Sciences and Systems.

[20]  Max M. He,et al.  Challenges of Identifying Clinically Actionable Genetic Variants for Precision Medicine , 2016, Journal of healthcare engineering.

[21]  Frederick J. Beutler,et al.  Random Sampling of Random Processes: Stationary Point Processes , 1966, Inf. Control..

[22]  K. Kazimierczuk,et al.  Random sampling of evolution time space and Fourier transform processing , 2006, Journal of biomolecular NMR.

[23]  Sudipto Guha,et al.  Near-optimal sparse fourier representations via sampling , 2002, STOC '02.

[24]  Peter J. Tonellato,et al.  Biomedical Cloud Computing With Amazon Web Services , 2011, PLoS Comput. Biol..

[25]  Fahad Saeed,et al.  Design and Implementation of Network Transfer Protocol for Big Genomic Data , 2015, 2015 IEEE International Congress on Big Data.

[26]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[27]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[28]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[29]  Richard A. Silverman,et al.  Alias-Free Sampling of Random Noise , 2018 .

[30]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[31]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[32]  Oscar A. Z. Leneman,et al.  Random Sampling of Random Processes: Impulse Processes , 1966, Inf. Control..

[33]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[34]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[35]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[36]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.