A Multilevel Parallelization Framework for High-Order Stencil Computations

Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via single-instruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.

[1]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[2]  Yves Robert,et al.  Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Rainer Bleck,et al.  Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic , 1992 .

[5]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[6]  Yousef Saad,et al.  Parallel methods and tools for predicting material properties , 2000, Comput. Sci. Eng..

[7]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[8]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[10]  Linda G. Shapiro,et al.  Computer and Robot Vision , 1991 .

[11]  J. Dongarra,et al.  The Impact of Multicore on Computational Science Software , 2007 .

[12]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[13]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[14]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[15]  Scott Pakin,et al.  Entering the petaflop era: the architecture and performance of Roadrunner , 2008, HiPC 2008.

[16]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[17]  Louis Turcotte,et al.  Proceedings of the 2000 ACM/IEEE conference on Supercomputing , 2000 .

[18]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Samuel Williams,et al.  Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Scott Pakin Receiver-initiated message passing over RDMA Networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Melinda Piket-May,et al.  9 – Computational Electromagnetics: The Finite-Difference Time-Domain Method , 2005 .

[22]  A. Nakano,et al.  Divide-and-conquer density functional theory on hierarchical real-space grids: Parallel implementation and applications , 2008 .

[23]  Ali-Reza Adl-Tabatabai,et al.  Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006 , 2006, Memory System Performance and Correctness.

[24]  William Kramer,et al.  Proceedings of the 2005 ACM/IEEE conference on Supercomputing , 2005 .

[25]  A. Nakano,et al.  Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers , 1994 .