论文信息 - Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture

Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture

The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forward-in-time algorithms, and performs a sequence of stencil computations. MPDATA is one of the major parts of the dynamic core of the EULAG geophysical model. The Intel Xeon Phi coprocessor is the first product based on the Intel Many Integrated Core (Intel MIC) architecture. This architecture offers notable performance advantages over traditional processors, and supports practically the same traditional parallel programming model. In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. This approach is based on combination of temporal and space blocking techniques, and allows us to ease memory and communication bounds and better exploit the theoretical floating point efficiency of target computing platforms. In order to utilize computing resources available in Intel Xeon Phi, the proposed approach employs two main levels of parallelism: (i) task parallelism which allows for utilization of more than 200 logical cores, and (ii) data parallelism to use efficiently 512-bit vector processing units. An important method of improving the efficiency of the block decomposition is partitioning of available cores/threads into teams. It allows us to reduce inter-cache communication overheads. Also, this method increases opportunities for the efficient distribution of MPDATA computation onto available resources. The purpose is to provide the trade-off between two coupled criteria: load balancing and intra-cache communication. We discuss performance results obtained on two platforms, including either two Intel Xeon E5-2643 CPUs and Intel Xeon Phi 3120A, or two Intel Xeon E5-2697 v2 CPUs and Intel Xeon Phi7120P. The top-of-the-line Intel Xeon Phi 7120P gives the best performance results for all tests. The achieved performance results provide a basis for fur-

Lukasz Szustak Czestochowa | Krzysztof Rojek Czestochowa

[1] Lukasz Szustak,et al. Using Blue Gene/P and GPUs to Accelerate Computations in the EULAG Model , 2011, LSSC.

[2] Lukasz Szustak,et al. Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture , 2013, PPAM.

[3] Lukasz Szustak,et al. Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture , 2012, Parallel Comput..

[4] Piotr K. Smolarkiewicz,et al. Multidimensional positive definite advection transport algorithm: an overview , 2006 .

[5] Gerhard Wellein,et al. Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[6] Lukasz Szustak,et al. Parallelization of EULAG Model on Multicore Architectures with GPU Accelerators , 2011, PPAM.

[7] Piotr K. Smolarkiewicz,et al. Towards petascale simulation of atmospheric circulations with soundproof equations , 2011 .

[8] Gerhard Wellein,et al. Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..