Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

Cloud computing, rapidly emerging as a new computation paradigm, provides agile and scalable resource access in a utility-like fashion, especially for the processing of big data. An important open issue here is to efficiently move the data, from different geographical locations over time, into a cloud for effective processing. The de facto approach of hard drive shipping is not flexible or secure. This work studies timely, cost-minimizing upload of massive, dynamically-generated, geo-dispersed data into the cloud, for processing using a MapReduce-like framework. Targeting at a cloud encompassing disparate data centers, we model a cost-minimizing data migration problem, and propose two online algorithms: an online lazy migration (OLM) algorithm and a randomized fixed horizon control (RFHC) algorithm , for optimizing at any given time the choice of the data center for data aggregation and processing, as well as the routes for transmitting data there. Careful comparisons among these online and offline algorithms in realistic settings are conducted through extensive experiments, which demonstrate close-to-offline-optimum performance of the online algorithms.

[1]  John V. Guttag,et al.  Power-demand routing in massive geo-distributed systems , 2010 .

[2]  Lachlan L. H. Andrew,et al.  Online algorithms for geographical load balancing , 2012, 2012 International Green Computing Conference (IGCC).

[3]  George E. P. Box,et al.  Time Series Analysis: Box/Time Series Analysis , 2008 .

[4]  Bo Li,et al.  Scaling social media applications into geo-distributed clouds , 2012, 2012 Proceedings IEEE INFOCOM.

[5]  Panos M. Pardalos,et al.  Handbook of Massive Data Sets , 2002, Massive Computing.

[6]  Minghua Chen,et al.  Simple and effective dynamic provisioning for power-proportional data centers , 2011, 2012 46th Annual Conference on Information Sciences and Systems (CISS).

[7]  W. Marsden I and J , 2012 .

[8]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[9]  Prashant J. Shenoy,et al.  Energy-aware load balancing in content delivery networks , 2011, 2012 Proceedings IEEE INFOCOM.

[10]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[11]  Indranil Gupta,et al.  Budget-constrained bulk data transfer via internet and shipping networks , 2011, ICAC '11.

[12]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[13]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[14]  Rajkumar Buyya,et al.  A Particle Swarm Optimization-Based Heuristic for Scheduling Workflow Applications in Cloud Computing Environments , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[15]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[16]  L H AndrewLachlan,et al.  Dynamic right-sizing for power-proportional data centers , 2013 .

[17]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[18]  J. Limb,et al.  Editorial on the IEEE/OSA Journal of Lightwave Technology and the IEEE Journal on Selected Areas in Communications , 1986 .

[19]  M. Crawford The Human Genome Project. , 1990, Human biology.

[20]  David A. Maltz,et al.  Cloudward bound: planning for beneficial migration of enterprise applications to the cloud , 2010, SIGCOMM '10.

[21]  Kurt M. Anstreicher,et al.  Linear Programming in O([n3/ln n]L) Operations , 1999, SIAM J. Optim..

[22]  Indranil Gupta,et al.  New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[23]  Jiangchuan Liu,et al.  Load-balanced migration of social media to content clouds , 2011, NOSSDAV.