论文信息 - A Multilevel Parallelization Framework for High-Order Stencil Computations

A Multilevel Parallelization Framework for High-Order Stencil Computations

Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via single-instruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.

[1] Allen Taflove,et al. Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[2] Yves Robert,et al. Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[3] Sanjay V. Rajopadhye,et al. Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4] Rainer Bleck,et al. Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic , 1992 .

[5] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[6] Yousef Saad,et al. Parallel methods and tools for predicting material properties , 2000, Comput. Sci. Eng..

[7] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.

[8] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[10] Linda G. Shapiro,et al. Computer and Robot Vision , 1991 .

[11] J. Dongarra,et al. The Impact of Multicore on Computational Science Software , 2007 .

[12] Patricia J. Teller,et al. Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[13] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[14] Guy L. Steele,et al. Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[15] Scott Pakin,et al. Entering the petaflop era: the architecture and performance of Roadrunner , 2008, HiPC 2008.

[16] Jack Dongarra,et al. Automatic Blocking of Nested Loops , 1990 .

[17] Louis Turcotte,et al. Proceedings of the 2000 ACM/IEEE conference on Supercomputing , 2000 .

[18] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Samuel Williams,et al. Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20] Scott Pakin. Receiver-initiated message passing over RDMA Networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21] Melinda Piket-May,et al. 9 – Computational Electromagnetics: The Finite-Difference Time-Domain Method , 2005 .

[22] A. Nakano,et al. Divide-and-conquer density functional theory on hierarchical real-space grids: Parallel implementation and applications , 2008 .

[23] Ali-Reza Adl-Tabatabai,et al. Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006 , 2006, Memory System Performance and Correctness.

[24] William Kramer,et al. Proceedings of the 2005 ACM/IEEE conference on Supercomputing , 2005 .

[25] A. Nakano,et al. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers , 1994 .