Hysteresis-based optimization of data transfer throughput

The achievable throughput for a data transfer can be determined by a variety of factors such as network bandwidth, round trip time, background traffic, dataset size, and end-system configuration. For the best-effort optimization of the transfer throughput, three application-layer transfer parameters -- pipelining, parallelism and concurrency -- have been actively used in the literature. However, it is highly challenging to identify the best combination of these parameter settings for a specific data transfer request. In this paper, we analyze historical data consisting of 70 Million file transfers; apply data mining techniques to extract the hidden relations among the parameters and the optimal throughput; and propose a novel approach based on hysteresis to predict the optimal parameter settings.

[1]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[2]  Peter A. Dinda,et al.  Characterizing and Predicting TCP Throughput on the Wide Area Network , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[3]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Tevfik Kosar,et al.  Application-Level Optimization of Big Data Transfers through Pipelining, Parallelism and Concurrency , 2016, IEEE Transactions on Cloud Computing.

[5]  Mary K. Vernon,et al.  Target bandwidth sharing using endhost measures , 2007, Perform. Evaluation.

[6]  Tevfik Kosar,et al.  A Highly-Accurate and Low-Overhead Prediction Model for Transfer Throughput Optimization , 2012, SC Companion.

[7]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[8]  Michal Daszykowski,et al.  Revised DBSCAN algorithm to cluster data with dense adjacent clusters , 2013 .

[9]  Ian T. Foster,et al.  A data transfer framework for large-scale science experiments , 2010, HPDC '10.

[10]  Brian D. Noble,et al.  Adaptive data block scheduling for parallel TCP streams , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[11]  Jon Crowcroft,et al.  Differentiated end-to-end Internet services using a weighted proportional fair sharing TCP , 1998, CCRV.

[12]  Robert L. Grossman,et al.  PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  JasonLee,et al.  Applied Techniques for High Bandwidth Data Transfers across Wide Area Networks , 2001 .

[15]  Tevfik Kosar,et al.  Dynamic Protocol Tuning Algorithms for High Performance Data Transfers , 2013, Euro-Par.

[16]  Ian T. Foster,et al.  Software as a service for data scientists , 2012, Commun. ACM.

[17]  Srinivasan Seshan,et al.  TCP behavior of a busy Internet server: analysis and improvements , 1997, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[18]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[19]  Ned Freed,et al.  SMTP Service Extension for Command Pipelining , 1997, RFC.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Mukta Paliwal,et al.  Neural networks and statistical techniques: A review of applications , 2009, Expert Syst. Appl..

[22]  Mehmet Balman,et al.  A new paradigm: Data-aware scheduling in grid computing , 2009, Future Gener. Comput. Syst..

[23]  John S. Heidemann,et al.  Effects of ensemble-TCP , 2000, CCRV.

[24]  JongWon Kim,et al.  TCP-ROME : performance and fairness in parallel downloads for Web and real time multimedia streaming applications , 2007 .

[25]  Miron Livny,et al.  Run-time Adaptation of Grid Data Placement Jobs , 2005, Scalable Comput. Pract. Exp..

[26]  Tevfik Kosar,et al.  Prediction of Optimal Parallelism Level in Wide Area Data Transfers , 2011, IEEE Transactions on Parallel and Distributed Systems.

[27]  Peter A. Dinda,et al.  Modeling and taming parallel TCP on the wide area network , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[28]  Eitan Altman,et al.  Parallel TCP Sockets: Simple Model, Throughput and Validation , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[29]  Tevfik Kosar,et al.  A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing , 2011, IEEE Trans. Parallel Distributed Syst..