Application Use Cases

Abstract Platform atomics provide memory consistency and atomicity across the heterogeneous architecture. They allow the latency compute units (CPU cores) and the throughput compute units (graphics processing unit, GPU cores) to simultaneously access the same memory locations. This chapter describes three case studies of algorithm patterns that benefit from platform atomics. The first case study is a task queue system in which the CPU produces tasks that are processed by the GPU. It avoids kernel relaunch and ensures load balancing. The second case study implements the breadth-first search as an example of an application that identifies dynamically the most appropriate cores (CPU or GPU) to execute a task. Thus, program execution can occasionally swap between the CPU and the GPU, depending on workload characteristics. The third case study is a data layout conversion routine that simultaneously uses the CPU and the GPU to shift elements of an array in-place. It represents a pattern of close collaboration between the CPU and the GPU to process a pool of fine-grained tasks.

[1]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[2]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[3]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[4]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[5]  Leslie Lamport,et al.  Specifying Concurrent Program Modules , 1983, TOPL.

[6]  Dick H. J. Epema,et al.  Experiences with the KOALA co-allocating scheduler in multiclusters , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[7]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Dinesh Manocha,et al.  Fast BVH Construction on GPUs , 2009, Comput. Graph. Forum.

[9]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[10]  Yangdong Deng,et al.  Taming irregular EDA applications on GPUs , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[11]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[12]  Juan Gómez-Luna,et al.  In-place transposition of rectangular matrices on accelerators , 2014, PPoPP '14.

[13]  Lars Karlsson,et al.  Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[14]  Wen-mei W. Hwu,et al.  DL: A data layout transformation system for heterogeneous computing , 2012, 2012 Innovative Parallel Computing (InPar).

[15]  Philippas Tsigas,et al.  Dynamic Load Balancing Using Work-Stealing , 2011 .