论文信息 - Automated partitioning design in parallel database systems

Automated partitioning design in parallel database systems

In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.

Nicolas Bruno | Rimma V. Nehme | Nicolas Bruno

[1] Nicolas Bruno,et al. Configuration-parametric query optimization for physical design tuning , 2008, SIGMOD Conference.

[2] Chun Zhang,et al. Automating physical database design in a parallel database , 2002, SIGMOD '02.

[3] Surajit Chaudhuri,et al. Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[4] Connolly,et al. Database Systems , 2004 .

[5] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[6] Anastasia Ailamaki,et al. Efficient Use of the Query Optimizer for Automated Database Design , 2007, VLDB.

[7] Abraham Silberschatz,et al. Efficient and Acurate Cost Models for Parallel Query Optimization. , 1996, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[8] Goetz Graefe. The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[9] Daniel C. Zilio,et al. Physical database design decision algorithms and concurrent reorganization for parallel database systems , 1998 .

[10] David J. DeWitt,et al. Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines , 1990, VLDB.

[11] Surajit Chaudhuri,et al. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.