How GridFTP Pipelining, Parallelism and Concurrency Work: A Guide for Optimizing Large Dataset Transfers

Optimizing the transfer of large files over high-bandwidth networks is a challenging task that requires the consideration of many parameters (e.g. network speed, roundtrip time, and current traffic). Unfortunately, this task becomes more complex when transferring datasets comprised of many small files. In this case, the performance of large dataset transfers not only depends on the characteristics of the transfer protocol and network, but also the number and the size distribution of the files that constitute the dataset. GridFTP is the most advanced transfer tool that provides functions to overcome large dataset transfer bottlenecks. Three of the most important parameters of GridFTP are pipelining, parallelism and concurrency. In this study, we research the effects of these three important parameters, provide models for optimization of these parameters, define guidelines and give an algorithm for their practical use for transfer of large datasets of varying size files.

[1]  Mehmet Balman,et al.  Stork data scheduler: mitigating the data bottleneck in e-Science , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Ian T. Foster,et al.  A data transfer framework for large-scale science experiments , 2010, HPDC '10.

[3]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Hyunseung Choo,et al.  Efficient Resource Management Scheme of TCP Buffer Tuned Parallel Stream to Optimize System Performance , 2005, EUC Workshops.

[5]  Manish Jain,et al.  Socket Buffer Auto-Sizing for High-Performance Data Transfers , 2003, Journal of Grid Computing.

[6]  Robert L. Grossman,et al.  PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[7]  Mehmet Balman,et al.  A new paradigm: Data-aware scheduling in grid computing , 2009, Future Gener. Comput. Syst..

[8]  Jon Crowcroft,et al.  Differentiated end-to-end Internet services using a weighted proportional fair sharing TCP , 1998, CCRV.

[9]  Tevfik Kosar,et al.  Network-aware end-to-end data throughput optimization , 2011, NDM '11.

[10]  Masayuki Murata,et al.  Scalable socket buffer tuning for high-performance Web servers , 2001, Proceedings Ninth International Conference on Network Protocols. ICNP 2001.

[11]  Ian T. Foster,et al.  Software as a service for data scientists , 2012, Commun. ACM.

[12]  John S. Heidemann,et al.  Effects of ensemble-TCP , 2000, CCRV.

[13]  Tevfik Kosar,et al.  Prediction of Optimal Parallelism Level in Wide Area Data Transfers , 2011, IEEE Transactions on Parallel and Distributed Systems.

[14]  Anna Morajko Dynamic tuning of parallel/distributed applications , 2004 .

[15]  Tevfik Kosar Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management , 2012 .

[16]  Hiroyuki Ohsaki,et al.  On Parameter Tuning of Data Transfer Protocol GridFTP for Wide-Area Networks , 2008 .

[17]  Ian Foster,et al.  GridFTP Pipelining , 2007 .

[18]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[19]  Peter A. Dinda,et al.  Modeling and taming parallel TCP on the wide area network , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[20]  Eitan Altman,et al.  Parallel TCP Sockets: Simple Model, Throughput and Validation , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.