Key based Reducer Placement for Data Analytics across Data Centers Considering Bi-level Resource Provision in Cloud Computing

Due to the distribution characteristic of the data source, such as astronomy and sales, or the legal prohibition, it is not always practical to store the world-wide data in only one data center (DC). Hadoop is a commonly accepted framework for big data analytics. But it can only deal with data within one DC. The distribution of data necessitates the study of Hadoop across DCs. In this situation, though we can place mapper in the local DCs, where to place reducers is a great challenge, since each reducer almost needs to process all map output across all involved DCs. Aiming to reduce costs, a keybased scheme is proposed which can respect the locality principle of traditional Hadoop as much as possible while realizing deployment of reducers with lower cost. Considering both data center level and server level resource provision, a bi-level programming is used to formalize the problem and it is solved by a tailored two level group genetic algorithm (TLGGA). Extensive simulations demonstrate the effectiveness of TLGGA. It can outperform both the baseline and the state-of-the-art mechanisms by 49% and 40%, respectively.

[1]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[2]  Gabriel Antoniu,et al.  MapIterativeReduce: a framework for reduction-intensive data processing on azure clouds , 2012, MapReduce '12.

[3]  Baochun Li,et al.  A General and Practical Datacenter Selection Framework for Cloud Services , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[4]  Lizhe Wang,et al.  Design and implementation of task scheduling strategies for massive remote sensing data processing across multiple data centers , 2014, Softw. Pract. Exp..

[5]  Vasudeva Varma,et al.  Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework , 2012, Future Gener. Comput. Syst..

[6]  Xuan Wang,et al.  A Unified Algorithm for Virtual Desktops Placement in Distributed Cloud Computing , 2016 .

[7]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[8]  J. Bard Some properties of the bilevel programming problem , 1991 .

[9]  Radu Tudoran,et al.  High-Performance Big Data Management Across Cloud Data Centers , 2014 .

[10]  Lijuan Wang,et al.  Multi-Phase Ant Colony System for Multi-Party Data-Intensive Service Provision , 2016, IEEE Transactions on Services Computing.

[11]  Victor I. Chang,et al.  A model to compare cloud and non-cloud storage of Big Data , 2016, Future Gener. Comput. Syst..

[12]  Chen He,et al.  HOG: Distributed Hadoop MapReduce on the Grid , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  Muthu Ramachandran,et al.  Cloud Computing Adoption Framework – a security framework for business clouds , 2015 .

[14]  Xuan Wang,et al.  Resource provision algorithms in cloud computing: A survey , 2016, J. Netw. Comput. Appl..

[15]  Muthu Ramachandran,et al.  Towards Achieving Data Security with the Cloud Computing Adoption Framework , 2016, IEEE Transactions on Services Computing.

[16]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[17]  Mohamed Cheriet,et al.  Carbon-aware distributed cloud: multi-level grouping genetic algorithm , 2015, Cluster Computing.

[18]  Longbo Huang,et al.  Power Cost Reduction in Distributed Data Centers: A Two-Time-Scale Approach for Delay Tolerant Workloads , 2015, IEEE Transactions on Parallel and Distributed Systems.

[19]  Gao Zi-You,et al.  A bi-level programming model and solution algorithm for the location of logistics distribution centers , 2008 .

[20]  Victor I. Chang,et al.  Towards a Big Data system disaster recovery in a Private Cloud , 2015, Ad Hoc Networks.

[21]  Bharadwaj Veeravalli,et al.  Space4time: Optimization latency-sensitive content service in cloud , 2014, J. Netw. Comput. Appl..

[22]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[23]  Ming-Jer Tsai,et al.  Optimal approximation algorithm of virtual machine placement for data latency minimization in cloud systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[24]  Jinjun Chen,et al.  A security framework in G-Hadoop for big data computing across distributed Cloud data centres , 2014, J. Comput. Syst. Sci..

[25]  ChangVictor Towards a Big Data system disaster recovery in a Private Cloud , 2015, AdHocNets 2015.

[26]  Khaled Tannir Optimizing Hadoop for MapReduce , 2014 .

[27]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[28]  Lijuan Wang,et al.  Facilitating an ant colony algorithm for multi-objective data-intensive service provision , 2015, J. Comput. Syst. Sci..

[29]  Patrick Th. Eugster,et al.  From the Cloud to the Atmosphere: Running MapReduce across Data Centers , 2014, IEEE Transactions on Computers.

[30]  Kevin T. Smith,et al.  Professional Hadoop Solutions , 2013 .