Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

The increased reliance of data management applications on cloud computing technologies has rendered research in identifying solutions to the data placement problem to be of paramount importance. The objective of the classical data placement problem is to optimally partition, while also allowing for replication, the set of data-items into distributed data centers to minimize the overall network communication cost. Despite significant advancement in data placement research, replica placement has seldom been studied in unison with data placement. More specifically, most of the existing solutions employ a two-phase approach: 1) data placement, followed by 2) replication. Replication should however be seen as an integral part of data placement, and should be studied as a joint optimization problem with the latter. In this paper, we propose a unified paradigm of data placement, called CPR, which combines data placement and replication of data-intensive services into geographically distributed clouds as a joint optimization problem. Underneath CPR, lies an overlapping correlation clustering algorithm capable of assigning a data-item to multiple data centers, thereby enabling us to jointly solve data placement and replication. Experiments on a real-world trace-based online social network dataset show that CPR is effective and scalable. Empirically, it is  35% better in efficacy on the evaluated metrics, while being up to 8 times faster in execution time when compared to state-of-the-art techniques.

[1]  Ying Ding,et al.  Automatic data placement and replication in grids , 2009, 2009 International Conference on High Performance Computing (HiPC).

[2]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[3]  Ümit V. Çatalyürek,et al.  PaToH: Partitioning Tool for Hypergraphs , 1999 .

[4]  Mohit Tawarmalani,et al.  Performance Sensitive Replication in Geo-distributed Cloud Datastores , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Junzhou Luo,et al.  Efficient Location-Aware Data Placement for Data-Intensive Applications in Geo-distributed Scientific Data Centers , 2016 .

[6]  Tao Yu,et al.  Intelligent Database Placement in Cloud Environment , 2012, 2012 IEEE 19th International Conference on Web Services.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Jianping Pan,et al.  Location-aware associated data placement for geo-distributed data-intensive applications , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[9]  Hanoch Levy,et al.  Resource placement and assignment in distributed network topologies , 2013, 2013 Proceedings IEEE INFOCOM.

[10]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11]  R. Manimegalai,et al.  Dynamic replica placement and selection strategies in data grids - A comprehensive survey , 2014, J. Parallel Distributed Comput..

[12]  Jianping Pan,et al.  A Framework of Hypergraph-Based Data Placement Among Geo-Distributed Datacenters , 2020, IEEE Transactions on Services Computing.

[13]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[14]  Rajkumar Buyya,et al.  An Algorithm for Network and Data-aware Placement of Multi-Tier Applications in Cloud Data Centers , 2017, J. Netw. Comput. Appl..

[15]  Jun Li,et al.  Multi-objective data placement for multi-cloud socially aware services , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[16]  Peng Wang,et al.  Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud , 2016, J. Electr. Comput. Eng..

[17]  S. D. Madhu Kumar,et al.  Optimizing VM allocation and data placement for data-intensive applications in cloud using ACO metaheuristic algorithm , 2017 .

[18]  Jianping Pan,et al.  Sketch-based data placement among geo-distributed datacenters for cloud storages , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[19]  Aristides Gionis,et al.  Overlapping Correlation Clustering , 2011, ICDM.

[20]  Xin Liu,et al.  Towards Intelligent Data Placement for Scientific Workflows in Collaborative Cloud Environment , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[21]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[22]  Yang Yue,et al.  A Data Placement Algorithm for Data Intensive Applications in Cloud , 2016 .

[23]  Anne-Marie Kermarrec,et al.  Content and geographical locality in user-generated content sharing systems , 2012, NOSSDAV '12.

[24]  Miron Livny,et al.  A framework for reliable and efficient data placement in distributed computing systems , 2005, J. Parallel Distributed Comput..

[25]  Erzhou Zhu,et al.  A Novel Workflow-Level Data Placement Strategy for Data-Sharing Scientific Cloud Workflows , 2019, IEEE Transactions on Services Computing.

[26]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[27]  Reynold Xin,et al.  Apache Spark , 2016 .

[28]  Marios Hadjieleftheriou,et al.  Distributed data placement to minimize communication costs via graph partitioning , 2014, SSDBM '14.

[29]  JooSeok Song,et al.  Adaptive Data Placement for Improving Performance of Online Social Network Services in a Multicloud Environment , 2017, Sci. Program..

[30]  Bruno Volckaert,et al.  Scalable Data Placement of Data-intensive Services in Geo-distributed Clouds , 2018, CLOSER.

[31]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[32]  Shiyong Lu,et al.  BDAP: A Big Data Placement Strategy for Cloud-Based Scientific Workflows , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[33]  Wei Guo,et al.  A Data Placement Strategy Based on Genetic Algorithm in Cloud Computing Platform , 2013, 2013 10th Web Information System and Application Conference.