GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

SUMMARY This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple

[1]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[2]  Weng-Fai Wong,et al.  Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[3]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Tamás Kis,et al.  A review of exact solution methods for the non-preemptive multiprocessor flowshop problem , 2005, Eur. J. Oper. Res..

[5]  Rubén Ruiz,et al.  The hybrid flow shop scheduling problem , 2010, Eur. J. Oper. Res..

[6]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[7]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[8]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[9]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[10]  Dan Werthimer,et al.  A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[11]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[12]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[13]  Fumihiko Ino,et al.  Sequence Homology Search Using Fine Grained Cycle Sharing of Idle GPUs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[14]  Kwan-Liu Ma,et al.  Multi-GPU volume rendering using MapReduce , 2010, HPDC '10.

[15]  Shaukat A. Brah,et al.  Comparison of Scheduling Rules in a Flow Shop with Multiple Processors: A Simulation , 1998, Simul..

[16]  Fumihiko Ino,et al.  Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs , 2010, IEICE Trans. Inf. Syst..

[17]  Nagarajan Kandasamy,et al.  A self-managing wide-area data streaming service , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[18]  Fumihiko Ino,et al.  A decompression pipeline for accelerating out-of-core volume rendering of time-varying data , 2008, Comput. Graph..

[19]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[20]  Fumihiko Ino,et al.  High-performance cone beam reconstruction using CUDA compatible GPUs , 2010, Parallel Comput..

[21]  Sudhakar Yalamanchili,et al.  Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22]  John W. Romein,et al.  Astronomical real-time streaming signal processing on a Blue Gene/L supercomputer , 2006, SPAA '06.

[23]  Gordon Reynolds,et al.  G T-B , 1980 .

[24]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[25]  Fumihiko Ino,et al.  A middleware for efficient stream processing in CUDA , 2010, Computer Science - Research and Development.

[26]  James M. Rehg,et al.  Stampede: A Cluster Programming Middleware for Interactive Stream-Oriented Applications , 2003, IEEE Trans. Parallel Distributed Syst..