Improving Execution Efficiency of Just-in-time Compilation based Query Processing on GPUs

In recent years, we have witnessed significant efforts to improve the performance of Online Analytical Processing (OLAP) on graphics processing units (GPUs). Most existing studies have focused on improving memory efficiency since memory stalls can play an essential role in query processing performance on GPUs. Motivated by the recent rise of just-in-time (JIT) compilation in query processing, we investigate whether and how we can further improve query processing performance on GPU. Specifically, we study the execution of state-of-the-art JIT compile-based query processing systems. We find that thanks to advanced techniques such as database compression and JIT compilation, memory stalls are no longer the most significant bottleneck. Instead, current JIT compile-based query processing encounters severe under-utilization of GPU hardware due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to improve GPU utilization during query execution. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan in order to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention helps to improve the accuracy of the cost model, 2) Pyper significantly outperforms other GPU query engines on TPC-H and SSB queries. PVLDB Reference Format: Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. Improving Execution Efficiency of Just-in-time Compilation based Query Processing on GPUs. PVLDB, 14(2): 202 214, 2021. doi:10.14778/3425879.3425890 PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/Xtra-Computing/Pyper. This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 2 ISSN 2150-8097. doi:10.14778/3425879.3425890

[1]  Stanimire Tomov,et al.  Load-balancing Sparse Matrix Vector Product Kernels on GPUs , 2020, ACM Trans. Parallel Comput..

[2]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[3]  Bingsheng He,et al.  GPUQP: query co-processing using graphics processors , 2007, SIGMOD '07.

[4]  Tilmann Rabl,et al.  Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects , 2020, SIGMOD Conference.

[5]  Samuel Madden,et al.  Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware , 2016, Proc. VLDB Endow..

[6]  Bingsheng He,et al.  Revisiting Hash Join on Graphics Processors: A Decade Later , 2019, 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW).

[7]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..

[8]  Bingsheng He,et al.  GPL: A GPU-based Pipelined Query Processing Engine , 2016, SIGMOD Conference.

[9]  Viktor Leis,et al.  Compiling Database Queries into Machine Code , 2014, IEEE Data Eng. Bull..

[10]  Sudhakar Yalamanchili,et al.  Red Fox: An Execution Environment for Relational Query Processing on GPUs , 2014, CGO '14.

[11]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[12]  Volker Markl,et al.  Hardware-Oblivious Parallelism for In-Memory Column-Stores , 2013, Proc. VLDB Endow..

[13]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[14]  Anastasia Ailamaki,et al.  Hardware-conscious Query Processing in GPU-accelerated Analytical Engines , 2019, CIDR.

[15]  Anastasia Ailamaki,et al.  HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines , 2019, Proc. VLDB Endow..

[16]  Tim Kraska,et al.  Getting Swole: Generating Access-Aware Code with Predicate Pullups , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[17]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[18]  Viktor Leis,et al.  Making Compiling Query Engines Practical , 2019, IEEE Transactions on Knowledge and Data Engineering.

[19]  Samuel Madden,et al.  A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics , 2020, SIGMOD Conference.

[20]  Rupesh Nasre,et al.  Optimizing graph processing on GPUs using approximate computing: poster , 2019, PPoPP.

[21]  Shan Wang,et al.  One size does not fit all: accelerating OLAP workloads with GPUs , 2020, Distributed and Parallel Databases.

[22]  Sheldon H. Jacobson,et al.  Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning , 2016, Discret. Optim..

[23]  Tilmann Rabl,et al.  Generating custom code for efficient query execution on heterogeneous processors , 2017, The VLDB Journal.

[24]  Jens Teubner,et al.  Data-parallel query processing on non-uniform data , 2020, Proc. VLDB Endow..

[25]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[26]  Todd C. Mowry,et al.  Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last , 2017, Proc. VLDB Endow..

[27]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[28]  Djamel Djenouri,et al.  Exploiting GPU and cluster parallelism in single scan frequent itemset mining , 2019, Inf. Sci..

[29]  Peter Benjamin Volk,et al.  GPU join processing revisited , 2012, DaMoN '12.

[30]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[31]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[32]  Kenneth A. Ross,et al.  Optimizing select conditions on GPUs , 2013, DaMoN '13.

[33]  Jens Teubner,et al.  Pipelined Query Processing in Coprocessor Environments , 2018, SIGMOD Conference.