Large transfers for data analytics on shared wide-area networks

One part of large-scale data analytics is the problem of transferring the data across wide-area networks (WANs). Often, the data must be gathered (e.g., from remote sites), processed, possibly transferred (e.g., for further processing), and then possibly disseminated. If the data-transfer stages are bottlenecks, the overall data analytics pipeline will be affected. Although a variety of tools and protocols have been developed for large data transfers on WANs, most of the related work has been in the context of dedicated or non-shared networks. However, in practice, most networks are likely to be shared. We consider and evaluate the problem of large data transfers on shared networks and large round-trip-times (RTT) as are found on many WANs. Using a variety of synthetic background network traffic (e.g., uniform, TCP, UDP, square waveform, bursty), we compare the performance of well-known protocols (e.g., GridFTP, UDT). On our emulated WAN network, both GridFTP and UDT perform well in all-TCP situations, but UDT performs better when UDP-based background traffic is prominent.

[1]  Ian T. Foster,et al.  A data transfer framework for large-scale science experiments , 2010, HPDC '10.

[2]  Jason Lee,et al.  Intra and Interdomain Circuit Provisioning Using the OSCARS Reservation System , 2006, 2006 3rd International Conference on Broadband Communications, Networks and Systems.

[3]  Eitan Altman,et al.  Parallel TCP Sockets: Simple Model, Throughput and Validation , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[4]  Jim Kurose,et al.  Computer Networking: A Top-Down Approach , 1999 .

[5]  Jim Kurose,et al.  Computer Networking: A Top-Down Approach (6th Edition) , 2007 .

[6]  Jason Leigh,et al.  Reliable Blast UDP : predictable high performance bulk data transfer , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[7]  Hari Balakrishnan,et al.  TCP ex machina: computer-generated congestion control , 2013, SIGCOMM.

[8]  Walid Dabbous,et al.  On TCP performance in a heterogeneous network: a survey , 2000, IEEE Commun. Mag..

[9]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Eli Dart,et al.  The Science DMZ: A network design pattern for data-intensive science , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Jason Lee,et al.  Lessons learned from moving earth system grid data sets over a 20 Gbps wide-area network , 2010, HPDC '10.

[12]  Lavanya Ramakrishnan,et al.  On-demand Overlay Networks for Large Scientific Data Transfers , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[13]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[14]  Raj Jain,et al.  A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems , 1998, ArXiv.

[15]  Alexander Afanasyev,et al.  Host-to-Host Congestion Control for TCP , 2010, IEEE Communications Surveys & Tutorials.

[16]  Costin Raiciu,et al.  Rekindling network protocol innovation with user-level stacks , 2014, CCRV.

[17]  Luigi Rizzo,et al.  Dummynet revisited , 2010, CCRV.

[18]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[19]  Aniket Mahanti,et al.  Comparative performance analysis of high-speed transfer protocols for big data , 2013, 38th Annual IEEE Conference on Local Computer Networks.