论文信息 - Abstract: Mapping Streaming Applications onto GPU Systems

Abstract: Mapping Streaming Applications onto GPU Systems

We describe an efficient and scalable code generation framework that automatically maps general purpose streaming applications onto GPU systems. This architecture-driven framework takes into account the idiosyncrasies of the GPU pipeline and the unique memory hierarchy. The framework has been implemented as a back-end to the StreamIt programming language compiler. Several key features in this framework ensure maximized performance and scalability. First, the generated code increases the effectiveness of the on-chip memory hierarchy by employing a heterogeneous mix of compute and memory access threads. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Second, we utilise an efficient stream graph partitioning algorithm to handle larger applications and achieve the best performance under the given on-chip memory constraints. Lastly, the framework maps complex applications onto multiple GPUs using a highly effective pipeline execution scheme. Our comprehensive experiments show its scalability and significant speedup compared to a state-of-the-art solution.

Weng-Fai Wong | Rick Siow Mong Goh | Huynh Phung Huynh | Abhishek Ray | Andrei Hagiescu

[1] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[2] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3] David R. Kaeli,et al. Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4] Scott A. Mahlke,et al. Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[5] George Karypis,et al. Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[6] Weng-Fai Wong,et al. Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7] Yun Liang,et al. Efficient custom instructions generation for system-level design , 2010, 2010 International Conference on Field-Programmable Technology.

[8] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[9] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[11] Henry Hoffmann,et al. A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[12] Abhishek Udupa,et al. Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[13] Sudhakar Yalamanchili,et al. Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15] Weng-Fai Wong,et al. Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.