Memory Aware Thread Aggregation Framework for Dynamic Parallelism in GPUs

Problems using matrices and vectors exhibit regular data-level parallelism, which yields maximum throughput when executed in GPUs. But problems which use data in the form of hierarchical data structures such as adaptive grids, sparse graphs and trees, and applications using recursion, may not have regular parallelism. The parallelism in algorithms based on these data structures are discovered dynamically as they are nested. Depending on the data being processed, each task may need to perform a variable amount of computation. The amount of parallelism available at any point of execution cannot be determined apriori as it is data dependent and evolves with the computation itself. This non-determinism is due to the existence of control flow divergence, diverse memory access patterns and data dependent parallelism in the problems, which do not fit into GPU architectures as the natural choice and result in much lesser utilization of GPU resources. A three-level thread aggregation is proposed in this paper for enhancing the resource utilization and occupancy. Threads are executed in the appropriate level after aggregation into larger grids. Programs from the Rodinia benchmark and NVIDIA SDK are used to test the framework. The occupancy improves from 2X to 15X and the performance improvement up to 50% in some cases.

[1]  Andreas Polze,et al.  Using Dynamic Parallelism for Fine-Grained, Irregular Workloads: A Case Study of the N-Queens Problem , 2015, 2015 Third International Symposium on Computing and Networking (CANDAR).

[2]  Sudhakar Yalamanchili,et al.  Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Jin Wang,et al.  Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Michela Becchi,et al.  Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Michela Becchi,et al.  Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations , 2015, 2015 44th International Conference on Parallel Processing.

[8]  Guoyang Chen,et al.  Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Dejan S. Milojicic,et al.  KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).