A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets

In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the effciency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data effciently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an effcient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists.

[1]  C. Mora,et al.  How Many Species Are There on Earth and in the Ocean? , 2011, PLoS biology.

[2]  Henry C. Tuckwell,et al.  World population , 1992, Nature.

[3]  E. F. Moore,et al.  Variable-length binary encodings , 1959 .

[4]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[5]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[6]  Giovanni Motta,et al.  Handbook of Data Compression , 2009 .

[7]  Bogdan M. Wilamowski,et al.  The Transmission Control Protocol , 2005, The Industrial Information Technology Handbook.

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[10]  Michael C. Schatz,et al.  The next 20 years of genome research , 2015, bioRxiv.

[11]  M. Batzer,et al.  An overview of the human genome project , 1994 .

[12]  Paul Greenfield,et al.  A Study of the Impact of Compression and Binary Encoding on SOAP Performance , 2005 .

[13]  Stephen W. Poole,et al.  Moving Large Data Sets Over High-Performance Long Distance Networks , 2011 .

[14]  N. Hawkins,et al.  Data sharing in genomics — re-shaping scientific practice , 2009, Nature Reviews Genetics.

[15]  Lei Chen,et al.  Compressed pattern matching in DNA sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[16]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[17]  Jun Li,et al.  Congestion control in named data networking - A survey , 2016, Comput. Commun..

[18]  Ralph E. Spencer,et al.  The square kilometre array: The ultimate challenge for processing big data , 2013 .

[19]  Steven J. Davis,et al.  A Competitive Perspective on Internet Explorer , 2000 .

[20]  Abhay Parekh,et al.  A generalized processor sharing approach to flow control in integrated services networks-the single node case , 1992, [Proceedings] IEEE INFOCOM '92: The Conference on Computer Communications.

[21]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[22]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[23]  Abhay K. Bhushan,et al.  The File Transfer Protocol , 1971, Request for Comments.

[24]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Fahad Saeed,et al.  Design and Implementation of Network Transfer Protocol for Big Genomic Data , 2015, 2015 IEEE International Congress on Big Data.

[27]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Andrew H. Mutz,et al.  Transparent Content Negotiation in HTTP , 1998, RFC.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Nadia Drake,et al.  Cloud computing beckons scientists , 2014, Nature.

[31]  Manish Kumar Ahirwar,et al.  A Brief Study of Data Compression Algorithms , 2013 .

[32]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[33]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[34]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[35]  M. Marazita,et al.  Genome-wide Association Studies , 2012, Journal of dental research.

[36]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[37]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[38]  Brian Craft,et al.  The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data , 2014, Database J. Biol. Databases Curation.

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[41]  Max M. He,et al.  Challenges of Identifying Clinically Actionable Genetic Variants for Precision Medicine , 2016, Journal of healthcare engineering.

[42]  Gonzalo Juan,et al.  Big Data on the Internet of Things: An Example for the E-health , 2012, 2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[43]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[44]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[45]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[46]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[47]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[48]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.