Automated partitioning design in parallel database systems

In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.

[1]  Nicolas Bruno,et al.  Configuration-parametric query optimization for physical design tuning , 2008, SIGMOD Conference.

[2]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[3]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[4]  Connolly,et al.  Database Systems , 2004 .

[5]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[6]  Anastasia Ailamaki,et al.  Efficient Use of the Query Optimizer for Automated Database Design , 2007, VLDB.

[7]  Abraham Silberschatz,et al.  Efficient and Acurate Cost Models for Parallel Query Optimization. , 1996, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[8]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[9]  Daniel C. Zilio,et al.  Physical database design decision algorithms and concurrent reorganization for parallel database systems , 1998 .

[10]  David J. DeWitt,et al.  Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines , 1990, VLDB.

[11]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[12]  The Vertica ® Analytic Database Technical Overview White Paper a Dbms Architecture Optimized for Next-generation Data Warehousing , .

[13]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[14]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[15]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[16]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[17]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[18]  Wenfei Fan,et al.  Distributed query evaluation with performance guarantees , 2007, SIGMOD '07.

[19]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[20]  Kenneth Steiglitz,et al.  Characterization and Theoretical Comparison of Branch-and-Bound Algorithms for Permutation Problems , 1974, JACM.

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[23]  Shamkant B. Navathe,et al.  Two techniques for on-line index modification in shared nothing parallel databases , 1996, SIGMOD '96.

[24]  Erhard Rahm,et al.  Multi-Dimensional Database Allocation for Parallel Data Warehouses , 2000, VLDB.

[25]  Abraham Silberschatz,et al.  Efficient and accurate cost models for parallel query optimization (extended abstract) , 1996, PODS.

[26]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[27]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[28]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[29]  Anastasia Ailamaki,et al.  Efficient use of the query optimizer for automated physical design , 2007, VLDB 2007.

[30]  Alan P. Sprague,et al.  Performance of parallel branch-and-bound algorithms , 1985, IEEE Transactions on Computers.