Parallel strategy for multiple scan operations with data replication

To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.

[1]  Colin Percival CACHE MISSING FOR FUN AND PROFIT , 2005 .

[2]  Joseph Y.-T. Leung,et al.  Complexity of Scheduling Parallel Task Systems , 1989, SIAM J. Discret. Math..

[3]  Dan Tsafrir,et al.  The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.

[4]  Luc Bouganim,et al.  Load Balancing for Parallel Query Execution on NUMA Multiprocessors , 2004, Distributed and Parallel Databases.

[5]  Harumi A. Kuno,et al.  Dynamic Workload Management for Very Large Data Warehouses: Juggling Feathers and Bowling Balls , 2007, VLDB.

[6]  Babak Falsafi,et al.  To Share or Not To Share? , 2007, VLDB.

[7]  Mikhail J. Atallah,et al.  Optimal Parallel I/O for Range Queries through Replication , 2002, DEXA.

[8]  Erhard Rahm,et al.  Analysis of Parallel Scan Processing in Shared Disk Database Systems , 1995, Euro-Par.

[9]  David J. DeWitt,et al.  Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[10]  Leonid B. Sokolinsky,et al.  Survey of Architectures of Parallel Database Systems , 2004, Programming and Computer Software.

[11]  Mikhail L. Zymbler,et al.  Encapsulation of partitioned parallelism into open-source database management systems , 2015, Programming and Computer Software.

[12]  Luc Bouganim,et al.  Dynamic Load Balancing in Hierarchical Parallel Database Systems , 1996, VLDB.

[13]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[14]  Gerhard Fettweis,et al.  An Optimization Methodology for Memory Allocation and Task Scheduling in SoCs Via Linear Programming , 2006, SAMOS.

[15]  Frederick Reiss,et al.  Main-memory scan sharing for multi-core CPUs , 2008, Proc. VLDB Endow..

[16]  Henri E. Bal,et al.  Replication techniques for speeding up parallel applications on distributed systems , 1992, Concurr. Pract. Exp..

[17]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[18]  Tei-Wei Kuo,et al.  Real-time data access control on B-tree index structures , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[19]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20]  W. Paul Cockshott Addressing Mechanisms and Persistent Programming , 1985, Data Types and Persistence , Informal Proceedings.

[21]  Hakan Ferhatosmanoglu,et al.  Efficient parallel processing of range queries through replicated declustering , 2006, Distributed and Parallel Databases.

[22]  Philip S. Yu,et al.  Scheduling and processor allocation for parallel execution of multijoin queries , 1992, [1992] Eighth International Conference on Data Engineering.

[23]  Sang Hyuk Son,et al.  Replicated data management in distributed database systems , 1988, SGMD.

[24]  Xiaoning Ding,et al.  MCC-DB: Minimizing Cache Conflicts in Multi-core Processors for Databases , 2009, Proc. VLDB Endow..

[25]  Patrick Valduriez,et al.  Parallel database systems: Open problems and new issues , 1993, Distributed and Parallel Databases.