Big data transfer optimization based on offline knowledge discovery and adaptive sampling

The amount of data moved over dedicated and non-dedicated network links increases much faster than the increase in the network capacity, but the current solutions fail to guarantee even the promised achievable transfer throughputs. In this paper, we propose a novel dynamic throughput optimization model based on mathematical modeling with offline knowledge discovery/analysis and adaptive online decision making. In offline analysis, we mine historical transfer logs to perform knowledge discovery about the transfer characteristics. Online phase uses the discovered knowledge from the offline analysis along with real-time investigation of the network condition to optimize the protocol parameters. As real-time investigation is expensive and provides partial knowledge about the current network status, our model uses historical knowledge about the network and data to reduce the real-time investigation overhead while ensuring near optimal throughput for each transfer. Our novel approach is tested over different networks with different datasets and outperformed its closest competitor by 1.7x and the default case by 5x. It also achieved up to 93% accuracy compared with the optimal achievable throughput possible on those networks.

[1]  Robert J. Nicholls,et al.  Resilience to natural hazards: How useful is this concept? , 2003 .

[2]  Tevfik Kosar,et al.  Data Management Challenges in Coastal Applications , 2007 .

[3]  Brian D. Noble,et al.  The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[4]  Tevfik Kosar,et al.  Hysteresis-based optimization of data transfer throughput , 2015, NDM '15.

[5]  James J. Hack,et al.  Response of Climate Simulation to a New Convective Parameterization in the National Center for Atmospheric Research Community Climate Model (CCM3) , 1998 .

[6]  Tevfik Kosar,et al.  A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing , 2011, IEEE Trans. Parallel Distributed Syst..

[7]  Tevfik Kosar,et al.  HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-Time Probing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  W. M. Wood-Vasey,et al.  SDSS-III: MASSIVE SPECTROSCOPIC SURVEYS OF THE DISTANT UNIVERSE, THE MILKY WAY, AND EXTRA-SOLAR PLANETARY SYSTEMS , 2011, 1101.1529.

[9]  Fausto Guzzetti,et al.  Use of GIS Technology in the Prediction and Monitoring of Landslide Hazard , 1999 .

[10]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  W. Cheney,et al.  Numerical analysis: mathematics of scientific computing (2nd ed) , 1991 .

[13]  Tevfik Kosar,et al.  Dynamic Protocol Tuning Algorithms for High Performance Data Transfers , 2013, Euro-Par.

[14]  Ian T. Foster,et al.  Software as a service for data scientists , 2012, Commun. ACM.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  G. Meehl,et al.  Climate extremes: observations, modeling, and impacts. , 2000, Science.

[17]  Jon Crowcroft,et al.  Differentiated end-to-end Internet services using a weighted proportional fair sharing TCP , 1998, CCRV.

[18]  Prasanna Balaprakash,et al.  Explaining Wide Area Data Transfer Performance , 2017, HPDC.

[19]  Ricky Egeland,et al.  PhEDEx Data Service , 2010 .

[20]  Tevfik Kosar,et al.  Balancing TCP buffer vs parallel streams in application level throughput optimization , 2009, DADC '09.

[21]  Prasanna Balaprakash,et al.  Improving Data Transfer Throughput with Direct Search Optimization , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[22]  Shlomo Moran,et al.  Optimal implementations of UPGMA and other common clustering algorithms , 2007, Inf. Process. Lett..

[23]  Peter A. Dinda,et al.  Modeling and taming parallel TCP on the wide area network , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[24]  Erwin Laure,et al.  B2SHARE: An Open eScience Data Sharing Platform , 2015, 2015 IEEE 11th International Conference on e-Science.