SpeCH: A scalable framework for data placement of data-intensive services in geo-distributed clouds

Abstract The advent of big data analytics and cloud computing technologies has resulted in wide-spread research on the data placement problem. Since data-intensive services require access to multiple datasets within each transaction, traditional schemes of uniformly partitioning the data into distributed nodes, as employed by many popular data stores like HDFS or Cassandra, may cause network congestion thereby affecting system throughput. In this article, we propose a scalable and unified framework for data-intensive service data placement into geographically distributed clouds. The proposed framework introduces a new paradigm for partitioning a set of data-items into geo-distributed clouds using Spectral Clustering on Hypergraphs, and is therefore called SpeCH. Scaling spectral methods to large workloads is challenging, since computing the spectra of the hypergraph laplacian is a computationally intensive task. SpeCH provides two solutions to tackle this problem: (1) an algorithm, called SpectralApprox, that leverages randomized techniques for obtaining low-rank approximations of the hypergraph matrix with bounded guarantees, thereby significantly improving the efficiency of spectral clustering while also providing high quality solutions in practice; (2) an algorithm, called SpectralDist, that exploits the highly parallel nature of the spectral clustering algorithm and uses Apache Spark to speed-up the process while retaining the same quality guarantees as the exact algorithm. Additionally, being distributed in nature, SpectralDist enables SpeCH to perform data placement on workloads that require resources beyond the capacity of a single machine. Experiments on a real-world trace-based online social network dataset show that the SpeCH is effective, efficient, and scalable. Empirically, SpectralApprox is comparable in efficacy on the evaluated metrics, while being up to 10 times faster in execution time when compared to state-of-the-art techniques. On the other hand, though SpectralApprox is 7–8 times faster when compared to SpectralDist, in terms of efficacy on the evaluated metrics the latter is up to 50% better.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Jianping Pan,et al.  Location-aware associated data placement for geo-distributed data-intensive applications , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[3]  JooSeok Song,et al.  Adaptive Data Placement for Improving Performance of Online Social Network Services in a Multicloud Environment , 2017, Sci. Program..

[4]  Xin Liu,et al.  Towards Intelligent Data Placement for Scientific Workflows in Collaborative Cloud Environment , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Abdul Quamar,et al.  SWORD: scalable workload-aware data placement for transactional workloads , 2013, EDBT '13.

[7]  S. D. Madhu Kumar,et al.  Optimizing VM allocation and data placement for data-intensive applications in cloud using ACO metaheuristic algorithm , 2017 .

[8]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Jianping Pan,et al.  A Framework of Hypergraph-Based Data Placement Among Geo-Distributed Datacenters , 2020, IEEE Transactions on Services Computing.

[11]  Shiyong Lu,et al.  BDAP: A Big Data Placement Strategy for Cloud-Based Scientific Workflows , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[12]  Hanoch Levy,et al.  Resource placement and assignment in distributed network topologies , 2013, 2013 Proceedings IEEE INFOCOM.

[13]  Wei Guo,et al.  A Data Placement Strategy Based on Genetic Algorithm in Cloud Computing Platform , 2013, 2013 10th Web Information System and Application Conference.

[14]  Junzhou Luo,et al.  Efficient Location-Aware Data Placement for Data-Intensive Applications in Geo-distributed Scientific Data Centers , 2016 .

[15]  Jianping Pan,et al.  Sketch-based data placement among geo-distributed datacenters for cloud storages , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[16]  Tao Yu,et al.  Intelligent Database Placement in Cloud Environment , 2012, 2012 IEEE 19th International Conference on Web Services.

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[19]  Mohit Tawarmalani,et al.  Performance Sensitive Replication in Geo-distributed Cloud Datastores , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[20]  Ümit V. Çatalyürek,et al.  PaToH: Partitioning Tool for Hypergraphs , 1999 .

[21]  Anne-Marie Kermarrec,et al.  Content and geographical locality in user-generated content sharing systems , 2012, NOSSDAV '12.

[22]  Erzhou Zhu,et al.  A Novel Workflow-Level Data Placement Strategy for Data-Sharing Scientific Cloud Workflows , 2019, IEEE Transactions on Services Computing.

[23]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[24]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[25]  Yang Yue,et al.  A Data Placement Algorithm for Data Intensive Applications in Cloud , 2016 .

[26]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[27]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Marios Hadjieleftheriou,et al.  Distributed data placement to minimize communication costs via graph partitioning , 2014, SSDBM '14.

[29]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[30]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[31]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[32]  Berkant Barla Cambazoglu,et al.  Document replication strategies for geographically distributed web search engines , 2013, Inf. Process. Manag..

[33]  Bruno Volckaert,et al.  Scalable Data Placement of Data-intensive Services in Geo-distributed Clouds , 2018, CLOSER.

[34]  Scott C. Deerwester,et al.  An Architecture for Full Text Retrieval Systems , 1990, DEXA.

[35]  Jun Li,et al.  Multi-objective data placement for multi-cloud socially aware services , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[36]  Peng Wang,et al.  Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud , 2016, J. Electr. Comput. Eng..

[37]  Reynold Xin,et al.  Apache Spark , 2016 .

[38]  Rajkumar Buyya,et al.  An Algorithm for Network and Data-aware Placement of Multi-Tier Applications in Cloud Data Centers , 2017, J. Netw. Comput. Appl..

[39]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[40]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.