A High-productivity Framework for Multi-GPU computation of Mesh-based applications

The paper proposes a high-productivity framework for multiGPU computation of mesh-based applications. In order to achieve high performance on these applications, we have to introduce complicated optimized techniques for GPU computing, which requires relatively-high cost of implementation. Our framework automatically translates user-written functions that update a grid point and generates both GPU and CPU code. In order to execute user code on multiple GPUs, the framework parallelizes this code by using MPI and OpenMP. The framework also provides C++ classes to write GPU-GPU communication effectively. The programmers write user code just in the C++ language and can develop program code optimized for GPU supercomputers without introducing complicated optimizations for GPU computation and GPU-GPU communication. As an experiment evaluation, we have implemented multi-GPU computation of a diffusion equation by using this framework and achieved good weak scaling results. By using peer-to-peer access between GPUs in this framework, the frameworkbased diffusion computation using two NVIDIA Tesla K20X GPUs is 1.4 times faster than manual implementation code. We also show computational results of the Rayleigh-Taylor instability obtained by 3D compressible flow computation written by this framework.

[1]  Matthias Christen,et al.  Patus for convenient high-performance stencils: Evaluation in earthquake simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[3]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4]  Ulrich Rüde,et al.  A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters , 2010, Parallel Comput..

[5]  Takashi Shimokawabe,et al.  145 TFlops Performance on 3990 GPUs of TSUBAME 2.0 Supercomputer for an Operational Weather Prediction , 2011, ICCS.

[6]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Jack Dongarra,et al.  Proceedings of the International Conference on Computational Science, ICCS 2011 , 2011 .

[8]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Adrian Sandu,et al.  Multi-core acceleration of chemical kinetics for simulation and prediction , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[11]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).