Data Locality-Aware Query Evaluation for Big Data Analytics in Distributed Clouds

With more and more enterprises and organizations outsourcing their IT services to distributed clouds for cost savings, historical and operational data generated by these services grows exponentially, which usually is stored in the data centers located at different geographic location in the distributed cloud. Such data referred to as big data now becomes an invaluable asset to many businesses or organizations, as it can be used to identify business advantages by helping them make their strategic decisions. Big data analytics thus is emerged as a main research topic in distributed cloud computing. The challenges associated with the query evaluation for big data analytics are that (i) its cloud resource demands are typically beyond the supplies by any single data center and expand to multiple data centers, and (ii) the source data of the query is located at different data centers. This creates heavy data traffic among the data centers in the distributed cloud, thereby resulting in high communication costs. A fundamental question for query evaluation of big data analytics thus is how to admit as many such queries as possible while keeping the accumulative communication cost minimized. In this paper, we investigate this question by formulating an online query evaluation problem for big data analytics in distributed clouds, with an objective to maximize the query acceptance ratio while minimizing the accumulative communication cost of query evaluation, for which we first propose a novel metric model to model different resource utilizations of data centres, by incorporating resource workloads and resource demands of each query. We then devise an efficient online algorithm. We finally conduct extensive experiments by simulations to evaluate the performance of the proposed algorithm. Experimental results demonstrate that the proposed algorithm is promising and outperforms other heuristics.

[1]  Minghua Chen,et al.  Online algorithms for uploading deferrable big data to the cloud , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[2]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[4]  Andrey Balmin,et al.  Dynamically optimizing queries over large scale data platforms , 2014, SIGMOD Conference.

[5]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[6]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[7]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[8]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[9]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[10]  Weifa Liang,et al.  Minimizing the Operational Cost of Data Centers via Geographical Electricity Price Diversity , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[11]  Zhenjiang Hu,et al.  Efficient query evaluation on distributed graphs with Hadoop environment , 2013, SoICT '13.

[12]  Ian Horrocks,et al.  Distributed Query Processing on the Cloud: the Optique Point of View (Short Paper) , 2013, OWLED.

[13]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[14]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[15]  Minghua Chen,et al.  Moving Big Data to The Cloud: An Online Cost-Minimizing Approach , 2013, IEEE Journal on Selected Areas in Communications.

[16]  José Luis Vázquez-Poletti,et al.  Provisioning data analytic workloads in a cloud , 2013, Future Gener. Comput. Syst..

[17]  Zhi-Li Zhang,et al.  A first look at inter-data center traffic characteristics via Yahoo! datasets , 2011, 2011 Proceedings IEEE INFOCOM.

[18]  M. Vijaya Shanthi,et al.  COST MINIMIZATION FOR BIG DATA PROCESSING IN GEO DISTRIBUTED DATA CENTERS , 2016 .

[19]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[20]  Xin Wang,et al.  Performance Guarantees for Distributed Reachability Queries , 2012, Proc. VLDB Endow..