Stadium Hashing: Scalable and Flexible Hashing on GPUs

Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only impose restrictions (e.g., inability to concurrently execute insertion and retrieval operations, limitation on the size of key-value data pairs) that limit their applicability, their performance does not scale to large hash tables that must be kept out-of-core in the host memory. In this paper we present Stadium Hashing (Stash) that is scalable to large hash tables and practical as it does not impose the aforementioned restrictions. To support large out-of-core hash tables, Stash uses a compact data structure named ticket-board that is separate from hash table buckets and is held inside GPU global memory. Ticket-board locally resolves significant portion of insertion and lookup operations and hence, by reducing accesses to the host memory, it accelerates the execution of these operations. Split design of the ticket-board also enables arbitrarily large keys and values. Unlike existing methods, Stash naturally supports concurrent insertions and retrievals due to its use of double hashing as the collision resolution strategy. Furthermore, we propose Stash with collaborative lanes (clStash) that enhances GPU's SIMD resource utilization for batched insertions during hash table creation. For concurrent insertion and retrieval streams, Stadium hashing can be up to 2 and 3 times faster than GPU Cuckoo hashing for in-core and out-of-core tables respectively.

[1]  John D. Owens,et al.  Real-time parallel hashing on the GPU , 2009, SIGGRAPH 2009.

[2]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[3]  Frank Mueller,et al.  GPU-Accelerated Text Mining , 2009 .

[4]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Bei Hua,et al.  GLZSS: LZSS Lossless Data Compression Can Be Faster , 2014, GPGPU@ASPLOS.

[6]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[7]  Richard W. Vuduc,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) , 2012, Synthesis Lectures on Computer Architecture.

[8]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[9]  Sudhakar Yalamanchili,et al.  Red Fox: An Execution Environment for Relational Query Processing on GPUs , 2014, CGO '14.

[10]  J. Ian Munro,et al.  Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[11]  Qin Zhang,et al.  Cache-Oblivious Hashing , 2010, PODS '10.

[12]  John D. Owens,et al.  Bin-Hash Indexing: A Parallel Method for Fast Query Processing , 2008, ICDE 2008.

[13]  Sylvain Lefebvre,et al.  Coherent parallel hashing , 2011, ACM Trans. Graph..

[14]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[15]  Hyesoon Kim,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[16]  Sudhakar Yalamanchili,et al.  Relational algorithms for multi-bulk-synchronous processors , 2013, PPoPP '13.

[17]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[18]  Mark J. Harris,et al.  Optimizing Parallel Prefix Operations for the Fermi Architecture , 2012 .

[19]  Keshav Pingali,et al.  Morph algorithms on GPUs , 2013, PPoPP '13.

[20]  Maurice Herlihy,et al.  Hopscotch Hashing , 2008, DISC.

[21]  John D. Owens,et al.  Building an Efficient Hash Table on the GPU , 2012 .

[22]  Ganesh Gopalakrishnan,et al.  Towards shared memory consistency models for GPUs , 2013, ICS '13.

[23]  Sylvain Lefebvre,et al.  Perfect spatial hashing , 2006, ACM Trans. Graph..

[24]  Keshav Pingali,et al.  Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.