MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures

The recent scale-up of GPU hardware through the integration of multiple GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalablity of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this paper, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses the efficient route for each data flow to minimize congestion. Our experiments on the DGX-1 machine show that MG-Join helps significantly reduce the communication overhead and achieves up to 97% utilization of the bisection bandwidth of the interconnects, resulting in significantly better scalability. Overall, MG-Join outperforms the state-of-the-art hash join implementations by up to 2.5x. MG-Join further helps improve the overall performance of TPC-H queries by up to 4.5x over multi-GPU version of an open-source commercial GPU database Omnisci.

[1]  Adit Kurniawan,et al.  Effective Router Assisted Congestion Control for SDN , 2018 .

[2]  Anastasia Ailamaki,et al.  Hardware-conscious Query Processing in GPU-accelerated Analytical Engines , 2019, CIDR.

[3]  Hong Chen,et al.  Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA , 2019, ICPP.

[4]  Peter R. Pietzuch,et al.  SquirrelJoin: Network-Aware Distributed Join Processing with Lazy Partitioning , 2017, Proc. VLDB Endow..

[5]  Xiao Chen,et al.  An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory , 2016, SIGMOD Conference.

[6]  Matthias Weidlich,et al.  Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[7]  Jennifer Rexford,et al.  Multi-Commodity Flow with In-Network Processing , 2018, ALGOCLOUD.

[8]  S. Santhosh Baboo,et al.  An Energy-Efficient Congestion-Aware Routing Protocol for Heterogeneous Mobile Ad Hoc Networks , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[9]  Martin L. Kersten,et al.  Accelerating Foreign-Key Joins using Asymmetric Memory Channels , 2011, ADMS@VLDB.

[10]  Anastasia Ailamaki,et al.  Hardware-Conscious Hash-Joins on GPUs , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[11]  Chiew Tong Lau,et al.  Improving Execution Efficiency of Just-in-time Compilation based Query Processing on GPUs , 2020, Proc. VLDB Endow..

[12]  Shinpei Kato,et al.  Relational Joins on GPUs: A Closer Look , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[14]  Gunrock: a high-performance graph processing library on the GPU , 2016, PPOPP.

[15]  Johann A. Briffa,et al.  Solving the Multi-Commodity Flow Problem using a Multi-Objective Genetic Algorithm , 2019, 2019 IEEE Congress on Evolutionary Computation (CEC).

[16]  Hao Li,et al.  Join algorithms on GPUs: A revisit after seven years , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[17]  Marian Seliuchenko,et al.  Enhanced multi-commodity flow model for QoS-aware routing in SDN , 2016, 2016 International Conference Radio Electronics & Info Communications (UkrMiCo).

[18]  Gunter Saake,et al.  Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware , 2014, Proc. VLDB Endow..

[19]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[20]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[21]  Jens Teubner,et al.  Data-parallel query processing on non-uniform data , 2020, Proc. VLDB Endow..

[22]  Alexander Aiken,et al.  A Distributed Multi-GPU System for Fast Graph Processing , 2017, Proc. VLDB Endow..

[23]  He Bingsheng,et al.  Revisiting Hash Join on Graphics Processors: A Decade Later , 2019 .

[24]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[25]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[26]  Ren Zhi,et al.  Centralized congestion control routing protocol based on multi-metrics for low power and lossy networks , 2017 .

[27]  Puneet Gupta,et al.  Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training , 2019, IEEE Micro.

[28]  Alfons Kemper,et al.  Flow-Join: Adaptive skew handling for distributed joins over high-speed networks , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[29]  Anastasia Ailamaki,et al.  HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines , 2019, Proc. VLDB Endow..

[30]  Gustavo Alonso,et al.  Rack-Scale In-Memory Join Processing using RDMA , 2015, SIGMOD Conference.

[31]  Hai Jin,et al.  DiGraph: An Efficient Path-based Iterative Directed Graph Processing System on Multiple GPUs , 2019, ASPLOS.

[32]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[33]  Yi-Cheng Tu,et al.  Fast Equi-Join Algorithms on GPUs: Design and Implementation , 2017, SSDBM.

[34]  Alfons Kemper,et al.  Locality-sensitive operators for parallel main-memory database clusters , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[35]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..

[36]  Mohamed G. Gouda,et al.  Maximizable routing metrics , 2003, TNET.

[37]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[38]  Thomas S. Huang,et al.  Interior Gateway Routing Protocol , 2012 .

[39]  Kenneth A. Ross,et al.  Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[40]  Prabhjot Kaur,et al.  Performance Analysis of RIP, OSPF, IGRP and EIGRP Routing Protocols in a Network , 2012 .

[41]  John D. Owens,et al.  Building an Efficient Hash Table on the GPU , 2012 .