Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

With the growth in transistor counts in modern hardware, heterogeneous systems are becoming commonplace. Core counts are increasing such that GPU and CPU designs are reaching deep into the tens of cores. For performance reasons, different cores in a heterogeneous platform follow different design choices. Based on throughput computing goals, GPU cores tend to support wide vectors and substantial register files. Current designs optimize CPU cores for latency, dedicating logic to caches and out-of-order dependence control. Heterogeneous parallel primitives (HPP) addresses two major shortcomings in current GPGPU programming models: it supports full composability by defining abstractions and increases flexibility in execution by introducing braided parallelism. Heterogeneous parallel primitives is an object-oriented, C++11-based programming model that addresses these shortcomings on both CPUs and massively multithreaded GPUs: it supports full composability by defining abstractions using distributed arrays and barrier objects, and it increases flexibility in execution by introducing braided parallelism. This paper implemented a feature-complete version of HPP, including all syntactic constructs, that runs on top of a task-parallel runtime executing on the CPU. They continue to develop and improve the model, including reducing overhead due to channel management, and plan to make a public version available sometime in the future.

[1]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[2]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[3]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[4]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Andrew S. Grimshaw,et al.  Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  A. S. Grimshaw,et al.  Braid: integrating task and data parallelism , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[7]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[8]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[9]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[10]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[11]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.