Characterization and analysis of dynamic parallelism in unstructured GPU applications

GPUs have been proven very effective for structured applications. However, emerging data intensive applications are increasingly unstructured - irregular in their memory and control flow behavior over massive data sets. While the irregularity in these applications can result in poor workload balance among fine-grained threads or coarse-grained blocks, one can still observe dynamically formed pockets of structured data parallelism that can locally effectively exploit the GPU compute and memory bandwidth. In this study, we seek to characterize such dynamically formed parallelism and and evaluate implementations designed to exploit them using CUDA Dynamic Parallelism (CDP) - an execution model where parallel workload are launched dynamically from within kernels when pockets of structured parallelism are detected. We characterize and evaluate such implementations by analyzing the impact on control and memory behavior measurements on commodity hardware. In particular, the study targets a comprehensive understanding of the overhead of current CDP support in GPUs in terms of kernel launch, memory footprint and algorithm overhead. Experiments show that while the CDP implementation can generate potentially 1.13x-2.73x speedup over non-CDP implementations, the non-trivial overhead causes the overall performance an average of 1.21x slowdown.

[1]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[2]  David A. Bader,et al.  Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13-14, 2012. Proceedings , 2013, Graph Partitioning and Graph Clustering.

[3]  Yong Tang,et al.  Gregex: GPU Based High Speed Regular Expression Matching Engine , 2011, 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[4]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[6]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[7]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[8]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[9]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[10]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[11]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[12]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[14]  Yuni Xia,et al.  GPU accelerated item-based collaborative filtering for big-data applications , 2013, 2013 IEEE International Conference on Big Data.

[15]  Thomas Sangild Sørensen,et al.  Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the GPU , 2005, EGVE'05.

[16]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[17]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[19]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  A L Kuhl,et al.  Thermodynamic States in Explosion Fields , 2009 .

[21]  David K. McAllister,et al.  OptiX: a general purpose ray tracing engine , 2010, ACM Trans. Graph..

[22]  Bruce K. Grace Black-Scholes option pricing via genetic algorithms , 2000 .

[23]  Fei Wang,et al.  Graph-Based Substructure Pattern Mining Using CUDA Dynamic Parallelism , 2013, IDEAL.

[24]  Sudhakar Yalamanchili,et al.  Relational algorithms for multi-bulk-synchronous processors , 2013, PPoPP '13.

[25]  Michela Taufer,et al.  Performance impact of dynamic parallelism on different clustering algorithms , 2013, Defense, Security, and Sensing.

[26]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).