StorkCloud: data transfer scheduling and optimization as a service

Wide-area transfer of large data sets is still a big challenge despite the deployment of high-bandwidth networks with speeds reaching 100 Gbps. Most users fail to obtain even a fraction of theoretical speeds promised by these networks. Effective usage of the available network capacity has become increasingly important for wide-area data movement. We have developed a "data transfer scheduling and optimization system as a Cloud-hosted service", StorkCloud, which will mitigate the large-scale end-to-end data movement bottleneck by efficiently utilizing underlying networks and effectively scheduling and optimizing data transfers. In this paper, we present the initial design and prototype implementation of StorkCloud, and show its effectiveness in large dataset transfers across geographically distant storage sites, data centers, and collaborating institutions.

[1]  Mehmet Balman,et al.  Stork data scheduler: mitigating the data bottleneck in e-Science , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Indranil Gupta,et al.  Budget-constrained bulk data transfer via internet and shipping networks , 2011, ICAC '11.

[3]  Tevfik Kosar,et al.  Prediction of Optimal Parallelism Level in Wide Area Data Transfers , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Anna R. Karlin,et al.  Implementing cooperative prefetching and caching in a globally-managed memory system , 1998, SIGMETRICS '98/PERFORMANCE '98.

[5]  Mark Handley,et al.  Data center networking with multipath TCP , 2010, Hotnets-IX.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Tevfik Kosar,et al.  Modeling throughput sampling size for a cloud-hosted data scheduling and optimization service , 2013, Future Gener. Comput. Syst..

[8]  Eitan Altman,et al.  Parallel TCP Sockets: Simple Model, Throughput and Validation , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[9]  Joel H. Saltz,et al.  Using overlays for efficient data transfer over shared wide-area networks , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Michael Sirivianos,et al.  Inter-datacenter bulk transfers with netstitcher , 2011, SIGCOMM.

[11]  Tevfik Kosar,et al.  Which network measurement tool is right for you? a multidimensional comparison study , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[12]  Zhi-Li Zhang,et al.  A first look at inter-data center traffic characteristics via Yahoo! datasets , 2011, 2011 Proceedings IEEE INFOCOM.

[13]  Mehmet Balman,et al.  Dynamic Adaptation of Parallelism Level in Data Transfer Scheduling , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[14]  Mehmet Balman,et al.  A new paradigm: Data-aware scheduling in grid computing , 2009, Future Gener. Comput. Syst..

[15]  Tevfik Kosar,et al.  Balancing TCP buffer vs parallel streams in application level throughput optimization , 2009, DADC '09.

[16]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[17]  Ned Freed,et al.  SMTP Service Extension for Command Pipelining , 1997, RFC.

[18]  Brian D. Noble,et al.  Adaptive data block scheduling for parallel TCP streams , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[19]  Robert J. Nicholls,et al.  Resilience to natural hazards: How useful is this concept? , 2003 .

[20]  Rajkumar Kettimuthu,et al.  Globus XIO pipe open driver: enabling GridFTP to leverage standard Unix tools , 2011, TG.

[21]  Ian Foster,et al.  GridFTP Pipelining , 2007 .

[22]  Ian T. Foster,et al.  Software as a service for data scientists , 2012, Commun. ACM.

[23]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24]  James J. Hack,et al.  Response of Climate Simulation to a New Convective Parameterization in the National Center for Atmospheric Research Community Climate Model (CCM3) , 1998 .

[25]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[26]  Tevfik Kosar,et al.  Network-aware end-to-end data throughput optimization , 2011, NDM '11.

[27]  Anna R. Karlin,et al.  Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling , 1996, TOCS.

[28]  Peter A. Dinda,et al.  Modeling and taming parallel TCP on the wide area network , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[29]  Tevfik Kosar,et al.  A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[30]  Simson L. Garfinkel,et al.  An Evaluation of Amazon's Grid Computing Services: EC2, S3, and SQS , 2007 .

[31]  Tevfik Kosar,et al.  A highly-accurate and low-overhead prediction model for transfer throughput optimization , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[32]  Ian T. Foster,et al.  A data transfer framework for large-scale science experiments , 2010, HPDC '10.

[33]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.