Design space exploration of GPU Accelerated cluster systems for optimal data transfer using PCIe bus

Use of accelerators such as GPUs is increasing, but efficient use of GPUs requires making good design choices. Such design choices include type of memory allocation and overlapping concurrency of data transfer with parallel computation. Performance varies with the application, hardware version such as generation of GPU, and software version including programming language drivers. This large number of design decisions makes it nearly impossible to obtain the optimal performance point by directly porting any application. This emphasizes the need for high level design decision guidelines for GPU accelerated cluster systems, applicable to a broad class of applications rather than any specific application. This paper proposes novel design guidelines for GPU accelerated cluster systems, to optimize the data transfer from host (CPU) to device (GPU) using the PCIe bus. In particular, we consider design choices offered by NVIDIA GPUs. Our main contribution is to build design guidelines that are applicable to a broad class of applications. We design 27 different versions of the same micro benchmark, where the design choices made by each version is unique. We observe that a speedup of 2.6x can be obtained just by making good design choices.

[1]  Amirali Baniasadi,et al.  GPU design space exploration: NN-based models , 2015, 2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM).

[2]  W. Paul Cockshott,et al.  Acceleration of Stereo-Matching on Multi-core CPU and GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[3]  Brett M. Bode,et al.  Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Ioannis Kompatsiaris,et al.  GPU acceleration for support vector machines , 2011, WIAMIS 2011.

[6]  Shinpei Kato,et al.  Data Transfer Matters for GPU Computing , 2013, 2013 International Conference on Parallel and Distributed Systems.

[7]  Margaret Martonosi,et al.  Stargazer: Automated regression-based GPU design space exploration , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[8]  John D. Owens,et al.  Distributed texture memory in a multi-GPU environment , 2006, GH '06.

[9]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[10]  Matei Ripeanu,et al.  StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.

[11]  Juan Fang,et al.  Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures , 2015, Journal of Computer Science and Technology.

[12]  Miriam Leeser,et al.  Accelerating K-Means clustering with parallel implementations and GPU computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Takeo Kanade,et al.  GPU-accelerated real-time 3D tracking for humanoid locomotion and stair climbing , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[16]  Gregory Butler,et al.  Accelerating Search of Protein Sequence Databases using CUDA-Enabled GPU , 2015, DASFAA.

[17]  Hans-Peter Seidel,et al.  EUROGRAPHICS 2007 / D. Cohen-Or and P. Slavík (Guest Editors) Stackless KD-Tree Traversal for High Performance GPU Ray Tracing , 2022 .

[18]  Twan Basten,et al.  Model-Driven Design-Space Exploration for Embedded Systems: The Octopus Toolset , 2010, ISoLA.

[19]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[20]  Yi Yang,et al.  Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors , 2015, LCPC.