Abstract: Mapping Streaming Applications onto GPU Systems

We describe an efficient and scalable code generation framework that automatically maps general purpose streaming applications onto GPU systems. This architecture-driven framework takes into account the idiosyncrasies of the GPU pipeline and the unique memory hierarchy. The framework has been implemented as a back-end to the StreamIt programming language compiler. Several key features in this framework ensure maximized performance and scalability. First, the generated code increases the effectiveness of the on-chip memory hierarchy by employing a heterogeneous mix of compute and memory access threads. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Second, we utilise an efficient stream graph partitioning algorithm to handle larger applications and achieve the best performance under the given on-chip memory constraints. Lastly, the framework maps complex applications onto multiple GPUs using a highly effective pipeline execution scheme. Our comprehensive experiments show its scalability and significant speedup compared to a state-of-the-art solution.

[1]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[2]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[5]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[6]  Weng-Fai Wong,et al.  Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Yun Liang,et al.  Efficient custom instructions generation for system-level design , 2010, 2010 International Conference on Field-Programmable Technology.

[8]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[9]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[11]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[12]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[13]  Sudhakar Yalamanchili,et al.  Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.