Revisiting hash join on graphics processors: a decade later

Over the last decade, significant research effort has been put into improving the performance of hash join operation on GPUs. Over the same period, there have been significant changes to the GPU architecture. Hence in this paper, we first revisit the major GPU hash join implementations in the last decade and detail how they take advantage of different GPU architecture features. We then perform a comprehensive performance evaluation of these implementations using different generations of GPUs released over the last decade, which helps to shed light on the impact of different architecture features and to identify the factors guiding the choice of these features. We then study how data characteristics like skew and match rate impact the performance of GPU hash join implementations and propose techniques to improve the performance of existing implementations under such conditions. Finally, we perform an in-depth comparison of the performance and cost-efficiency of GPU hash join implementations against state-of-the-art CPU implementation.

[1]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[2]  Martin L. Kersten,et al.  Accelerating Foreign-Key Joins using Asymmetric Memory Channels , 2011, ADMS@VLDB.

[3]  Gustavo Alonso,et al.  MQJoin: Efficient Shared Execution of Main-Memory Joins , 2016, Proc. VLDB Endow..

[4]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  John D. Owens,et al.  Building an Efficient Hash Table on the GPU , 2012 .

[6]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[7]  Yi-Cheng Tu,et al.  Fast Equi-Join Algorithms on GPUs: Design and Implementation , 2017, SSDBM.

[8]  Jignesh M. Patel,et al.  Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.

[9]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Anastasia Ailamaki,et al.  QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[12]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[13]  Ingolf Geist,et al.  Towards Optimization of Hybrid CPU/GPU Query Plans in Database Systems , 2012, ADBIS Workshops.

[14]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[15]  Hao Li,et al.  Join algorithms on GPUs: A revisit after seven years , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Shinpei Kato,et al.  Relational Joins on GPUs: A Closer Look , 2017, IEEE Transactions on Parallel and Distributed Systems.

[17]  Volker Markl,et al.  Hardware-Oblivious Parallelism for In-Memory Column-Stores , 2013, Proc. VLDB Endow..

[18]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[19]  Kai-Uwe Sattler,et al.  Multi-level Parallel Query Execution Framework for CPU and GPU , 2013, ADBIS.

[20]  Siyuan Ma,et al.  Concurrent Analytical Query Processing with GPUs , 2014, Proc. VLDB Endow..

[21]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[22]  Peter Benjamin Volk,et al.  GPU join processing revisited , 2012, DaMoN '12.

[23]  Yuan Yuan,et al.  The Yin and Yang of Processing Data Warehousing Queries on GPU Devices , 2013, Proc. VLDB Endow..

[24]  Bingsheng He,et al.  Cache-oblivious databases: Limitations and opportunities , 2008, TODS.

[25]  Bingsheng He,et al.  Revisiting Hash Join on Graphics Processors: A Decade Later , 2019, 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW).

[26]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..

[27]  Sudhakar Yalamanchili,et al.  Red Fox: An Execution Environment for Relational Query Processing on GPUs , 2014, CGO '14.

[28]  Martin L. Kersten,et al.  Waste not… Efficient co-processing of relational data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[29]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[30]  Jin Wang,et al.  Relational Algebra Algorithms and Data Structures for GPU , 2012 .

[31]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[32]  Anastasia Ailamaki,et al.  Hardware-Conscious Hash-Joins on GPUs , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[33]  Gustavo Alonso,et al.  Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited , 2013, Proc. VLDB Endow..

[34]  Xiao Chen,et al.  An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory , 2016, SIGMOD Conference.