Revisiting Pipelined Parallelism in Multi-Join Query Processing

Multi-join queries are the core of any integration service that integrates data from multiple distributed data sources. Due to the large number of data sources and possibly high volumes of data, the evaluation of multi-join queries faces increasing scalability concerns. State-of-the-art parallel multi-join query processing commonly assume that the application of maximal pipelined parallelism leads to superior performance. In this paper, we instead illustrate that this assumption does not generally hold. We investigate how best to combine pipelined parallelism with alternate forms of parallelism to achieve an overall effective processing strategy. A segmented bushy processing strategy is proposed. Experimental studies are conducted on an actual software system over a cluster of high-performance PCs. The experimental results confirm that the proposed solution leads to about 50% improvement in terms of total processing time in comparison to existing state-of-the-art solutions.

[1]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[2]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[3]  Patrick Martin,et al.  Parallel Hash-Based Join Algorithms for a Shared-Everything , 1994, IEEE Trans. Knowl. Data Eng..

[4]  Waqar Hasan,et al.  Optimization of SQL Queries for Parallel Machines , 1996, Lecture Notes in Computer Science.

[5]  Hongjun Lu,et al.  Processing multi-join query in parallel systems , 1992, SAC '92.

[6]  Philip S. Yu,et al.  Scheduling and processor allocation for parallel execution of multijoin queries , 1992, [1992] Eighth International Conference on Data Engineering.

[7]  David J. DeWitt,et al.  Data placement in shared-nothing parallel database systems , 1997, The VLDB Journal.

[8]  Elke A. Rundensteiner,et al.  A Dynamically Adaptive Distributed System for Processing Complex Continuous Queries , 2005, VLDB.

[9]  Patrick Valduriez,et al.  Prototyping Bubba, A Highly Parallel Database System , 1990, IEEE Trans. Knowl. Data Eng..

[10]  Peter M. G. Apers,et al.  Parallelism in a Main-Memory DBMS: The Performance of PRISMA/DB , 1992, VLDB.

[11]  Luc Bouganim,et al.  Dynamic Load Balancing in Hierarchical Parallel Database Systems , 1996, VLDB.

[12]  Elke A. Rundensteiner,et al.  Multiversion-based view maintenance over distributed data sources , 2004, TODS.

[13]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[14]  Ming-Syan Chen,et al.  On the Complexity of Distributed Query Optimization , 1996, IEEE Trans. Knowl. Data Eng..

[15]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[16]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[17]  Hongjun Lu,et al.  Hash-based join algorithms for multiprocessor computers with shared memory , 1990, VLDB 1990.

[18]  Rajeev Motwani,et al.  Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism , 1994, VLDB.

[19]  Peter M. G. Apers,et al.  Parallel evaluation of multi-join queries , 1995, SIGMOD '95.

[20]  Minos N. Garofalakis,et al.  Multi-dimensional resource scheduling for parallel queries , 1996, SIGMOD '96.

[21]  Philip S. Yu,et al.  On optimal processor allocation to support pipelined hash joins , 1993, SIGMOD Conference.

[22]  Jaideep Srivastava,et al.  Optimizing multi-joint queries in parallel relational databases , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[23]  Gautam Jain Query Optimization for Parallel Execution , 2007 .

[24]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[25]  Philip S. Yu,et al.  Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins , 1992, VLDB.

[26]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[27]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[28]  Mikal Ziane,et al.  Parallel query processing with zigzag trees , 2005, The VLDB Journal.

[29]  Hongjun Lu,et al.  Hash-Based Join Algorithms for Multiprocessor Computers , 1990, VLDB.