An Analytical Study of Recursive Tree Traversal Patterns on Multi- and Many-Core Platforms

Recursive tree traversals are found in many application domains, such as data mining, graphics, machine learning and scientific simulations. In the past few years there has been growing interest in the deployment of applications based on graph data structures on many-core devices. A couple of recent efforts have focused on optimizing the execution of multiple serial tree traversals on GPU, and have reported performance trends that vary across algorithms. In this work, we aim to understand how to select the implementation and platform that is most suited to a given tree traversal algorithm and dataset. To this end, we perform a systematic study of recursive tree traversal on CPU, GPU and the Intel Phi processor. We first identify four tree traversal patterns: three of them performing multiple serial traversals concurrently, and the last one performing a single parallel level order traversal. For each of these patterns, we consider different code variants including existing and new optimization methods, and we characterize their control-flow and memory access patterns. We implement these code variants and evaluate them on CPU, GPU and Intel Phi. Our analysis shows that there is not a single code variant and platform that achieves the best performance on all tree traversal patterns, and it provides guidelines on the selection of the implementation most suited to a given tree traversal pattern and input dataset.

[1]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  Guoyang Chen,et al.  Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[4]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[5]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Ümit V. Çatalyürek,et al.  Betweenness centrality on GPUs and heterogeneous architectures , 2013, GPGPU@ASPLOS.

[7]  Michael Goldfarb,et al.  General transformations for GPU execution of tree traversals , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Wu-chun Feng,et al.  Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  Michela Becchi,et al.  Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations , 2015, 2015 44th International Conference on Parallel Processing.

[10]  Mehmet Deveci,et al.  Parallel Graph Coloring for Manycore Architectures , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[12]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13]  Xiangke Liao,et al.  RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[14]  Sudhakar Yalamanchili,et al.  Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Keshav Pingali,et al.  Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.

[16]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[18]  Srimat T. Chakradhar,et al.  GRapid: A compilation and runtime framework for rapid prototyping of graph applications on many-core processors , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[19]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[20]  Suresh Venkatasubramanian,et al.  Evaluating graph coloring on GPUs , 2011, PPoPP '11.

[21]  Michela Becchi,et al.  Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  John D. Owens,et al.  Performance Characterization of High-Level Programming Models for GPU Graph Analytics , 2015, 2015 IEEE International Symposium on Workload Characterization.

[24]  Keshav Pingali,et al.  Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.