Acceleration of a Python-Based Tsunami Modelling Application via CUDA and OpenHMPP

Modern graphics processing units (GPUs) have became powerful and cost-effective computing platforms. Parallel programming standards (e.g. CUDA) and directive-based programming standards (like OpenHMPP and OpenACC) are available to harness this tremendous computing power to tackle largescale modelling and simulation in scientific areas. ANUGA is a tsunami modelling application which is based on unstructured triangular meshes and implemented in Python/C. This paper explores issues in porting and optimizing a Python/C-based unstructured mesh application to GPUs. Two paradigms are compared: CUDA via the PyCUDA API, involving writing GPU kernels, and OpenHMPP, involving adding directives to C code. In either case, the `naive' approach of transferring unstructured mesh data to the GPU for each kernel resulted in an actual slowdown over single core performance on a CPU. Profiling results confirmed that this is due to data transfer times of the device to/from the host, even though all individual kernels achieved a good speedup. This necessitated an advanced approach, where all key data structures are mirrored on the host and the device. For both paradigms, this in turn involved converting all code updating these data structures to CUDA (or directive-augmented C, in the case of OpenHMPP). Furthermore, in the case of CUDA, the porting can no longer be done incrementally: all changes must be made in a single step. For debugging, this makes identifying which kernel(s) that have introduced bugs very difficult. To alleviate this, we adopted the relative debugging technique to the host-device context. Here, when in debugging mode, the mirrored data structures are updated upon each step on both the host (using the original serial code) and the device, with any discrepancy being immediately detected. We present a generic Python-based implementation of this technique. With this approach, the CUDA version achieved 2x speedup, and the OpenHMPP achieved 1.6x. The main optimization of unstructured mesh rearrangement to achieve coalesced memory access patterns contributed to 10% of the former. In terms of productivity, however, OpenHMPP achieved significantly better speedup per hour of programming effort.

[1]  David Abramson,et al.  Relative debugging: a new methodology for debugging scientific applications , 1996, CACM.

[2]  José Miguel Mantas,et al.  Simulation of one-layer shallow water systems on multicore and CUDA architectures , 2010, The Journal of Supercomputing.

[3]  José M. Mantas,et al.  GPU computing for shallow water flow simulation based on finite volume schemes , 2011 .

[4]  Stephen Roberts,et al.  Hydrodymamic modelling of coastal inundation , 2005 .

[5]  Rainald Löhner,et al.  Running unstructured grid‐based CFD solvers on modern graphics hardware , 2011 .

[6]  Bormin Huang,et al.  GPU Acceleration of Tsunami Propagation Model , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[7]  David Abramson,et al.  Relative Debugging and its Application to the Development of Large Numerical Models , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[8]  Arun K. Somani,et al.  Unstructured grid applications on GPU: performance analysis and improvement , 2011, GPGPU-4.

[9]  Chin-Chuan Han,et al.  A GPU-Based Simulation of Tsunami Propagation and Inundation , 2009, ICA3PP.

[10]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Lennart Ohlsson,et al.  PyGPU: A high-level language for high-speed image processing , 2007 .

[12]  Mustafa S. Altinakar,et al.  Efficient shallow water simulations on GPUs: Implementation, visualization, verification, and validation , 2012 .

[13]  Minglu Li,et al.  GPU-accelerated DNA Distance Matrix Computation , 2011, 2011 Sixth Annual Chinagrid Conference.

[14]  David Abramson,et al.  DUCT: An Interactive Define-Use Chain Navigation Tool for Relative Debugging , 2003, ArXiv.

[15]  Peter E. Strazdins,et al.  Experiences in Teaching a Specialty Multicore Computing Course , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[16]  Rok Sosic,et al.  Relative Debugging Using Multiple Program Versions , 1995 .

[17]  José Miguel Mantas,et al.  An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems , 2012, J. Parallel Distributed Comput..

[18]  Stephen Roberts,et al.  Parallelisation of a finite volume method for hydrodynamic inundation modelling , 2007 .