Modeling throughput sampling size for a cloud-hosted data scheduling and optimization service

As big-data processing and analysis dominates the usage of the Cloud systems, the need for Cloud-hosted data scheduling and optimization services increases. One key component for such a service is to provide available bandwidth and achievable throughput estimation capabilities, since all scheduling and optimization decisions would be built on top of this information. The biggest challenge in providing these estimation capabilities is the dynamic decision of what proportion of the actual dataset, when transferred, would give us an accurate estimate of the bandwidth and throughput achieved by transferring the whole data set. That proportion of data is called the sampling size (or the probe size). Although small fixed sample sizes worked well for high-latency low-bandwidth networks in the past, high-bandwidth networks require much larger and more dynamic sample sizes, since an accurate estimation now also depends on how fast the transfer protocol can saturate that fat network link. In this study, we present a model to decide the optimal sampling size based on the data size and estimated capacity of the network. Our results show that the predicted sampling size is very accurate compared to the targeted best sampling size for a certain file transfer in a majority of the cases.

[1]  Richard G. Baraniuk,et al.  pathChirp: Efficient available bandwidth estimation for network paths , 2003 .

[2]  Indranil Gupta,et al.  Budget-constrained bulk data transfer via internet and shipping networks , 2011, ICAC '11.

[3]  Tevfik Kosar,et al.  Which network measurement tool is right for you? a multidimensional comparison study , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[4]  Zhi-Li Zhang,et al.  A first look at inter-data center traffic characteristics via Yahoo! datasets , 2011, 2011 Proceedings IEEE INFOCOM.

[5]  Xiaoyuan Yang,et al.  Inter-datacenter bulk transfers with netstitcher , 2011 .

[6]  Peter A. Dinda,et al.  An Extensible Toolkit for Resource Prediction In Distributed Systems , 1999 .

[7]  Tevfik Kosar,et al.  How GridFTP Pipelining, Parallelism and Concurrency Work: A Guide for Optimizing Large Dataset Transfers , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[8]  Ian T. Foster,et al.  Software as a service for data scientists , 2012, Commun. ACM.

[9]  Hua Li,et al.  Scheduling and Transport for File Transfers on High-Speed Optical Circuits , 2003, Journal of Grid Computing.

[10]  Peter Steenkiste,et al.  Evaluation and characterization of available bandwidth probing techniques , 2003, IEEE J. Sel. Areas Commun..

[11]  Biswanath Mukherjee,et al.  Algorithms for Integrated Routing and Scheduling for Aggregating Data from Distributed Resources on a Lambda Grid , 2008, IEEE Transactions on Parallel and Distributed Systems.

[12]  James J. Hack,et al.  Response of Climate Simulation to a New Convective Parameterization in the National Center for Atmospheric Research Community Climate Model (CCM3) , 1998 .

[13]  Tevfik Kosar,et al.  A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing , 2011, IEEE Trans. Parallel Distributed Syst..

[14]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[15]  Anne E. Trefethen,et al.  The Data Deluge: An e-Science Perspective , 2003 .

[16]  M. Frans Kaashoek,et al.  A measurement study of available bandwidth estimation tools , 2003, IMC '03.

[17]  Ian T. Foster,et al.  A data transfer framework for large-scale science experiments , 2010, HPDC '10.

[18]  Joel H. Saltz,et al.  A dynamic scheduling approach for coordinated wide-area data transfers using GridFTP , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[20]  Tevfik Kosar,et al.  Network-aware end-to-end data throughput optimization , 2011, NDM '11.

[21]  Jennifer M. Schopf,et al.  Predicting sporadic grid data transfers , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  Jinoh Kim,et al.  Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[24]  Tevfik Kosar,et al.  Prediction of Optimal Parallelism Level in Wide Area Data Transfers , 2011, IEEE Transactions on Parallel and Distributed Systems.

[25]  Mehmet Balman,et al.  Stork data scheduler: mitigating the data bottleneck in e-Science , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[26]  Y. Raghu Reddy,et al.  Web100: extended TCP instrumentation for research, education and diagnosis , 2003, CCRV.

[27]  Les Cottrell Measuring End-To-End Bandwidth with Iperf Using Web100 , 2003 .

[28]  Robert J. Nicholls,et al.  Resilience to natural hazards: How useful is this concept? , 2003 .

[29]  Tevfik Kosar,et al.  Data Management Challenges in Coastal Applications , 2007 .

[30]  Joel H. Saltz,et al.  Using overlays for efficient data transfer over shared wide-area networks , 2008, HiPC 2008.

[31]  Simson L. Garfinkel,et al.  An Evaluation of Amazon's Grid Computing Services: EC2, S3, and SQS , 2007 .