论文信息 - GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

SUMMARY This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple

Fumihiko Ino | Kenichi Hagihara | Shinta Nakagawa

[1] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[2] Weng-Fai Wong,et al. Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[3] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4] Tamás Kis,et al. A review of exact solution methods for the non-preemptive multiprocessor flowshop problem , 2005, Eur. J. Oper. Res..

[5] Rubén Ruiz,et al. The hybrid flow shop scheduling problem , 2010, Eur. J. Oper. Res..

[6] Scott A. Mahlke,et al. Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[7] Abhishek Udupa,et al. Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[8] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[9] Weng-Fai Wong,et al. Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[10] Dan Werthimer,et al. A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[11] William J. Dally,et al. Imagine: Media Processing with Streams , 2001, IEEE Micro.

[12] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.

[13] Fumihiko Ino,et al. Sequence Homology Search Using Fine Grained Cycle Sharing of Idle GPUs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[14] Kwan-Liu Ma,et al. Multi-GPU volume rendering using MapReduce , 2010, HPDC '10.

[15] Shaukat A. Brah,et al. Comparison of Scheduling Rules in a Flow Shop with Multiple Processors: A Simulation , 1998, Simul..

[16] Fumihiko Ino,et al. Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs , 2010, IEICE Trans. Inf. Syst..

[17] Nagarajan Kandasamy,et al. A self-managing wide-area data streaming service , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[18] Fumihiko Ino,et al. A decompression pipeline for accelerating out-of-core volume rendering of time-varying data , 2008, Comput. Graph..

[19] Rohit Chandra,et al. Parallel programming in openMP , 2000 .

[20] Fumihiko Ino,et al. High-performance cone beam reconstruction using CUDA compatible GPUs , 2010, Parallel Comput..

[21] Sudhakar Yalamanchili,et al. Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22] John W. Romein,et al. Astronomical real-time streaming signal processing on a Blue Gene/L supercomputer , 2006, SPAA '06.

[23] Gordon Reynolds,et al. G T-B , 1980 .

[24] Rolf Apweiler,et al. The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[25] Fumihiko Ino,et al. A middleware for efficient stream processing in CUDA , 2010, Computer Science - Research and Development.

[26] James M. Rehg,et al. Stampede: A Cluster Programming Middleware for Interactive Stream-Oriented Applications , 2003, IEEE Trans. Parallel Distributed Syst..