An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group byoperation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small reducing data movement costs whose compensation is a major challenge for heterogeneous computing. Generally for group by computation on GPUs, one relies either on sorting or hashing. Today, empirical results suggest that hash-based approaches are superior. However by concept, hashing induces an unpredictable memory access pattern being in conflict with the architecture of GPUs. This motivates studying why current sort-based approaches are generally inferior. Our results indicate that current sorting solutions cannot exploit the full parallel power of modern GPUs. Experimentally, we show that the issue arises from the need to synchronize parallel threads that access the shared memory location containing the aggregates via atomics. Our quantification of the optimal performance motivates us to investigate how to minimize the overhead of atomics. This results in different variants using atomics, where the best variants almost mitigate the atomics overhead entirely. The results of a large-scale evaluation reveal that our approach achieves a 3x speed-up over existing sort-based approaches and up to 2x speed-up over hash-based approaches.

[1]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[2]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[3]  Sebastian Breß,et al.  Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS , 2013, Proc. VLDB Endow..

[4]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[5]  Tor M. Aamodt,et al.  General-Purpose Graphics Processor Architectures , 2018, General-Purpose Graphics Processor Architectures.

[6]  Guy M. Lohman,et al.  Optimizing GPU-accelerated Group-By and Aggregation , 2015, ADMS@VLDB.

[7]  Jürgen Teich,et al.  Integration of FPGAs in Database Management Systems: Challenges and Opportunities , 2018, Datenbank-Spektrum.

[8]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[9]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[10]  Jens Teubner,et al.  Robust Query Processing in Co-Processor-accelerated Databases , 2016, SIGMOD Conference.

[11]  Volker Markl,et al.  The Operator Variant Selection Problem on Heterogeneous Hardware , 2015, ADMS@VLDB.

[12]  Philippas Tsigas,et al.  Modeling the Performance of Atomic Primitives on Modern Architectures , 2019, ICPP.

[13]  Haicheng Wu Acceleration and execution of relational queries using general purpose graphics processing unit (GPGPU) , 2015 .

[14]  Gunter Saake,et al.  SIMD Vectorized Hashing for Grouped Aggregation , 2018, ADBIS.

[15]  Volker Markl,et al.  Efficient SIMD Vectorization for Hashing in OpenCL , 2018, EDBT.

[16]  Gunter Saake,et al.  Memory Management Strategies in CPU/GPU Database Systems: A Survey , 2018, BDAS.

[17]  Peter A. Boncz,et al.  Optimizing Group-By and Aggregation using GPU-CPU Co-Processing , 2018, ADMS@VLDB.

[18]  Amitava Datta,et al.  Exploring graphics processing units as parallel coprocessors for online aggregation , 2010, DOLAP '10.

[19]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[20]  H. Fröning,et al.  Software-Based Buffering of Associative Operations on Random Memory Addresses , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  M. Anusha,et al.  Big Data-Survey , 2016 .