Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA

In data management systems, query processing on GPUs or distributed clusters have proven to be an effective method for high efficiency. However, the high PCIe data transfer overhead between CPUs and GPUs, and the communication cost between nodes in distributed systems are usually bottleneck for improving system performance. Recently, GPUDirect RDMA has been developed and has received a lot of attention. It contains the features of the RDMA and GPUDirect technologies, which provides new opportunities for optimizing query processing. In this paper, we revisit the join algorithm, one of the most important operators in query processing, with GPUDirect RDMA. Specifically, we explore the performance of the hash join and sort merge join with GPUDirect RDMA. We present a new design using GPUDirect RDMA to improve the data communication in distributed join algorithms on multi-GPU clusters. We propose a series of techniques, including multi-layer data partitioning, and adaptive data communication path selection for various transmission channels. Experiments show that the proposed distributed join algorithms using GPUDirect RDMA achieve up to 1.83x performance speedup compared to the state-of-the-art distributed join algorithms. To the best of our knowledge, this is the first work for distributed GPU join algorithms. We believe that the insights and implications in this study shall shed lights on future researches using GPUDirect RDMA.

[1]  Gustavo Alonso,et al.  Distributed Join Algorithms on Thousands of Cores , 2017, Proc. VLDB Endow..

[2]  Dhabaleswar K. Panda,et al.  Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[3]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[4]  Daniel Peter Playne,et al.  A New Algorithm for Parallel Connected-Component Labelling on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[5]  Weifeng Liu,et al.  Fast segmented sort on GPUs , 2017, ICS.

[6]  Chonggang Wang,et al.  GPU-Accelerated High-Throughput Online Stream Data Processing , 2018, IEEE Transactions on Big Data.

[7]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[8]  Wu-chun Feng,et al.  Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[9]  Henri Casanova,et al.  Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs , 2018, ICS.

[10]  Sadaf R. Alam,et al.  Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[11]  Yi Lu,et al.  AdaptDB: Adaptive Partitioning for Distributed Joins , 2017, Proc. VLDB Endow..

[12]  Jack J. Dongarra,et al.  Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[13]  Shinpei Kato,et al.  Relational Joins on GPUs: A Closer Look , 2017, IEEE Transactions on Parallel and Distributed Systems.

[14]  Gustavo Alonso,et al.  Rack-Scale In-Memory Join Processing using RDMA , 2015, SIGMOD Conference.

[15]  Dhabaleswar K. Panda,et al.  Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[16]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[17]  Dhabaleswar K. Panda,et al.  Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures , 2018, EuroMPI.

[18]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[19]  Sreeram Potluri,et al.  GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters , 2018, J. Parallel Distributed Comput..

[20]  Andrea Clematis,et al.  An MPI-CUDA library for image processing on HPC architectures , 2015, J. Comput. Appl. Math..

[21]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[22]  John M. Levesque,et al.  An MPI/OpenACC implementation of a high-order electromagnetics solver with GPUDirect communication , 2016, Int. J. High Perform. Comput. Appl..

[23]  Johannes Langguth,et al.  GPU-based Acceleration of Detailed Tissue-Scale Cardiac Simulations , 2018, GPGPU@PPoPP.

[24]  Awais Ahmad,et al.  Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem , 2018, International Journal of Parallel Programming.

[25]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[26]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[27]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[28]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[29]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[30]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[31]  Hari Sundar,et al.  Utilizing GPU Parallelism to Improve Fast Spherical Harmonic Transforms , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[32]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[33]  Xiaoyong Du,et al.  An adaptive breadth-first search algorithm on integrated architectures , 2018, The Journal of Supercomputing.

[34]  Wu-chun Feng,et al.  MPI-ACC: Accelerator-Aware MPI for Scientific Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[35]  Hao Wang,et al.  SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU , 2019, PPoPP.

[36]  Odysseas I. Pentakalos An Introduction to the InfiniBand Architecture , 2002, Int. CMG Conference.

[37]  Roberto Palmieri,et al.  Understanding RDMA Behavior in NUMA Systems , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[38]  Anastasia Ailamaki,et al.  Hardware-conscious Query Processing in GPU-accelerated Analytical Engines , 2019, CIDR.

[39]  Dhabaleswar K. Panda,et al.  Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Wenguang Chen,et al.  Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures , 2017, IEEE Transactions on Parallel and Distributed Systems.

[41]  Dhabaleswar K. Panda,et al.  HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[42]  Dhabaleswar K. Panda,et al.  Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , 2019, IEEE Transactions on Parallel and Distributed Systems.

[43]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[44]  Yi-Cheng Tu,et al.  Fast Equi-Join Algorithms on GPUs: Design and Implementation , 2017, SSDBM.

[45]  Hao Li,et al.  Join algorithms on GPUs: A revisit after seven years , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[46]  Hao Wang,et al.  Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[47]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.