QoS-aware data replications and placements for query evaluation of big data analytics

Enterprise users at different geographic locations generate large-volume data and store their data at different geographic datacenters. These users may also issue ad hoc queries of big data analytics on the stored data to identify valuable information in order to help them make strategic decisions. However, it is well known that querying such large-volume big data usually is time-consuming and costly. Sometimes, users are only interested in timely approximate rather than exact query results. When this approximation is the case, applications must sacrifice either timeliness or accuracy by allowing either the latency of delivering more accurate results or the accuracy error of delivered results based on the samples of the data, rather than the entire set of data itself. In this paper, we study the QoS-aware data replications and placements for approximate query evaluation of big data analytics in a distributed cloud, where the original (source) data of a query is distributed at different geo-distributed datacenters. We focus on placing the samples of the source data with certain error bounds at some strategic datacenters to meet users' stringent query response time. We propose an efficient algorithm for evaluating a set of big data analytic queries with the aim to minimize the evaluation cost of the queries while meeting their response time requirements. We demonstrate the effectiveness of the proposed algorithm through experimental simulations. Experimental results show that the proposed algorithm is promising.

[1]  Weifa Liang,et al.  Data Locality-Aware Big Data Query Evaluation in Distributed Clouds , 2017, Comput. J..

[2]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[3]  Weifa Liang,et al.  Collaboration- and Fairness-Aware Big Data Management in Distributed Clouds , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Weifa Liang,et al.  Operational cost minimization of distributed data centers through the provision of fair request rate allocations while meeting different user SLAs , 2015, Comput. Networks.

[5]  Frank Neven,et al.  Making Queries Tractable on Big Data with Preprocessing , 2013, Proc. VLDB Endow..

[6]  Ramesh K. Sitaraman,et al.  Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics , 2016, SoCC.

[7]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[8]  Weifa Liang,et al.  Electricity Cost Minimization in Distributed Clouds by Exploring Heterogeneity of Cloud Resources and User Demands , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[9]  Weifa Liang,et al.  Data Locality-Aware Query Evaluation for Big Data Analytics in Distributed Clouds , 2014, 2014 Second International Conference on Advanced Cloud and Big Data.

[10]  Zheng Zhang,et al.  Error-bounded Sampling for Analytics on Big Sparse Data , 2014, Proc. VLDB Endow..

[11]  Marios Hadjieleftheriou,et al.  Distributed data placement to minimize communication costs via graph partitioning , 2014, SSDBM '14.

[12]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[13]  Rajkumar Buyya,et al.  Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing , 2012, Future Gener. Comput. Syst..

[14]  Joseph K. Liu,et al.  Toward efficient and privacy-preserving computing in big data era , 2014, IEEE Network.

[15]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[16]  Chaitanya Swamy,et al.  Approximation Algorithms for Data Placement Problems , 2008, SIAM J. Comput..

[17]  Weifa Liang,et al.  The operational cost minimization in distributed clouds via community-aware user data placements of social networks , 2017, Comput. Networks.

[18]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[19]  Weifa Liang,et al.  Minimizing the Operational Cost of Data Centers via Geographical Electricity Price Diversity , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[20]  Samir Khuller,et al.  Data Placement and Replica Selection for Improving Co-location in Distributed Environments , 2013, ArXiv.