Efficient Data-Parallel Primitives on Heterogeneous Systems

Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.

[1]  Volker Markl,et al.  Hardware-Oblivious Parallelism for In-Memory Column-Stores , 2013, Proc. VLDB Endow..

[2]  Karl Rupp,et al.  Performance portability study of linear algebra kernels in OpenCL , 2014, IWOCL '14.

[3]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[4]  Bruce Merry,et al.  A Performance Comparison of Sort and Scan Libraries for GPUs , 2015, Parallel Process. Lett..

[5]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[6]  Wolfgang Lehner,et al.  Big data causing big (TLB) problems: taming random memory accesses on the GPU , 2017, DaMoN.

[7]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[8]  Jignesh M. Patel,et al.  Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.

[9]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[10]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Jens Dittrich,et al.  On the Surprising Difficulty of Simple Things: the Case of Radix Partitioning , 2015, Proc. VLDB Endow..

[12]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[13]  Qiong Luo,et al.  Revisiting Multi-pass Scatter and Gather on GPUs , 2018, ICPP.

[14]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[15]  Bingsheng He,et al.  Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach , 2015, Proc. VLDB Endow..

[16]  Tilmann Rabl,et al.  Generating custom code for efficient query execution on heterogeneous processors , 2017, The VLDB Journal.

[17]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[18]  Kenneth A. Ross,et al.  A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort , 2014, SIGMOD Conference.

[19]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[20]  Samuel Madden,et al.  Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware , 2016, Proc. VLDB Endow..

[21]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[22]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[23]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[24]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[25]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[26]  Yao Zhang,et al.  Improving Performance Portability in OpenCL Programs , 2013, ISC.

[27]  Jie Shen,et al.  Performance Traps in OpenCL for CPUs , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[28]  Jianbin Fang,et al.  Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels , 2014, 2014 43rd International Conference on Parallel Processing.

[29]  Ulrich Meyer,et al.  GPU Multisplit , 2017, ACM Trans. Parallel Comput..

[30]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[31]  Duane Merrill,et al.  Single-pass Parallel Prefix Scan with Decoupled Lookback , 2016 .

[32]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[33]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.