Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance computing. These platforms provide a Software Development Kit (SDK) to maximize performance at the expense of dealing with complex and low-level architectural details which makes the software development a daunting task. This paper explores stencil computations in several heterogeneous programming models like Cell SDK, CellSs, ALF and CUDA to optimize the Jacobi method for solving Laplace’s differential equation. We describe the programming techniques to extract the maximum performance on the Cell BE and the GPU, and compare their computing paradigms. Experimental results are shown on two Nvidia Teslas and one IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v 5.1. The speed-up factor for our set of GPU optimizations reaches 3–4×, and the execution times defeat those of the Cell BE by an order of magnitude, also showing great scalability when moving towards newer GPU generations and/or more demanding problem sizes.

[1]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[2]  Bruce P. Lester The art of parallel programming , 1993 .

[3]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[4]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[5]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[6]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[8]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[9]  Rodrigo Weber dos Santos,et al.  Comparing CUDA and OpenGL implementations for a Jacobi iteration , 2009, 2009 International Conference on High Performance Computing & Simulation.

[10]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[11]  José L. Abellán,et al.  Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades , 2008, ICCS.

[12]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  M.D. McCool,et al.  Scalable Programming Models for Massively Multicore Processors , 2008, Proceedings of the IEEE.

[14]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[15]  Wang Gui-bin,et al.  Optimizing stencil application on multi-thread GPU architecture using stream programming model , 2010, ARCS 2010.

[16]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[17]  Manuel E. Acacio,et al.  Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades , 2009, Euro-Par.

[18]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[20]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.