Distributed Data Placement via Graph Partitioning

With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of {\sc Graph Partitioning}, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost). We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the {\sc Graph Partitioning} solution of the no-replication case. Using the TPC-DS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.

[1]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[2]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[3]  Theodore Johnson,et al.  Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[4]  Tamara G. Kolda,et al.  Graph partitioning models for parallel computing , 2000, Parallel Comput..

[5]  Nicolas Bruno,et al.  Automated partitioning design in parallel database systems , 2011, SIGMOD '11.

[6]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[7]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[8]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[9]  Samir Khuller,et al.  Data Placement and Replica Selection for Improving Co-location in Distributed Environments , 2013, ArXiv.

[10]  David J. DeWitt,et al.  Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines , 1990, VLDB.

[11]  Shashi Shekhar,et al.  Partitioning Similarity Graphs: A Framework for Declustering Problems , 1996, Inf. Syst..

[12]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[13]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[14]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[15]  Daniel C. Zilio,et al.  Physical database design decision algorithms and concurrent reorganization for parallel database systems , 1998 .