Design and Implementation of Network Transfer Protocol for Big Genomic Data

Genomic data is growing exponentially due to next generation sequencing technologies (NGS) and their ability to produce massive amounts of data in a short time. NGS technologies generate big genomic data that needs to be exchanged between different locations efficiently and reliably. The current network transfer protocols rely on Transmission Control Protocol (TCP) or User Data gram Protocol (UDP) protocols, ignoring data size and type. Universal application layer protocols such as HTTP are designed for wide variety of data types and are not particularly efficient for genomic data. Therefore, we present a new data-aware transfer protocol for genomic-data that increases network throughput and reduces latency, called Genomic Text Transfer Protocol (GTTP). In this paper, we design and implement a new network transfer protocol for big genomic DNA dataset that relies on the Hypertext Transfer Protocol (HTTP). Modification to content-encoding of HTTP has been done that would transfer big genomic DNA datasets using machine-to-machine (M2M) and client(s)-server topologies. Our results show that our modification to HTTP reduces the transmitted data by 75% of original data and still be able to regenerate the data at the client side for bioinformatics analysis. Consequently, the transfer of data using GTTP is shown to be much faster (about 8 times faster than HTTP) when compared with regular HTTP.

[1]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[2]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[3]  Jon Postel,et al.  User Datagram Protocol , 1980, RFC.

[4]  Xiaohui Chen,et al.  Impact of HTTP Compression on Web Response Time in Asymmetrical Wireless Network , 2009, 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing.

[5]  Jon Postel,et al.  IANA Charset Registration Procedures , 2000, RFC.

[6]  Fahad Saeed,et al.  A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes , 2012, J. Parallel Distributed Comput..

[7]  Abhay K. Bhushan,et al.  The File Transfer Protocol , 1971, Request for Comments.

[8]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[9]  Jeffrey C. Mogul,et al.  HTTP Header Field Registrations , 2005, RFC.

[10]  O. Oyman,et al.  Quality of experience for HTTP adaptive streaming services , 2012, IEEE Communications Magazine.

[11]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[12]  Cristina Cattaneo,et al.  Introduction to genomics. , 2012, Methods in molecular biology.

[13]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[14]  Bogdan M. Wilamowski,et al.  The Transmission Control Protocol , 2005, The Industrial Information Technology Handbook.

[15]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.