RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities

Tree traversals are widely used irregular applications. Given a tree traversal algorithm, where a single tree is traversed by multiple queries (with truncation), its efficient parallelization on GPUs is hindered by branch divergence, load imbalance and memory-access irregularity, as the nodes and their visitation orders differ greatly under different queries. We leverage a key insight made on several truncation-induced tree traversal regularities to enable as many threads in the same warp as possible to visit the same node simultaneously, thereby enhancing both GPU resource utilization and memory coalescing at the same time. We introduce a new parallelization approach, RegTT, to orchestrate an efficient execution of a tree traversal algorithm on GPUs by starting with BFT (Breadth-First Traversal), then reordering the queries being processed (based on their truncation histories), and finally, switching to DFT (Depth-First Traversal). RegTT is general (without relying on domain-specific knowledge) and automatic (as a source-code transformation). For a set of five representative benchmarks used, RegTT outperforms the state-of-the-art by 1.66x on average.

[1]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[2]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[3]  Milind Kulkarni,et al.  Automatically enhancing locality for tree traversals with traversal splicing , 2012, OOPSLA '12.

[4]  E. Mansson,et al.  Deep Coherent Ray Tracing , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.

[5]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH 2008.

[6]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[7]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[8]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[9]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[10]  Tim Foley,et al.  KD-tree acceleration structures for a GPU raytracer , 2005, HWWS '05.

[11]  Michael Goldfarb,et al.  General transformations for GPU execution of tree traversals , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[13]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Hans-Peter Seidel,et al.  Stackless KD‐Tree Traversal for High Performance GPU Ray Tracing , 2007, Comput. Graph. Forum.

[15]  Xiangke Liao,et al.  An Efficient GPU Implementation of Inclusion-Based Pointer Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.

[16]  Jingling Xue,et al.  Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs , 2012, 2012 41st International Conference on Parallel Processing.

[17]  Milind Kulkarni,et al.  Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.

[18]  Hui Wu,et al.  Parallelizing SOR for GPGPUs using alternate loop tiling , 2012, Parallel Comput..

[19]  Michael Goldfarb,et al.  Automatic vectorization of tree traversals , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[20]  James R. Larus,et al.  SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[21]  Jingling Xue,et al.  Model-Driven Tile Size Selection for DOACROSS Loops on GPUs , 2011, Euro-Par.

[22]  Yang Yang,et al.  A Highly Parallel Reuse Distance Analysis Algorithm on GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Sriram Krishnamoorthy,et al.  Efficient execution of recursive programs on commodity vector hardware , 2015, PLDI.

[24]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.