Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

This paper presents a system for automatically supporting the optimization of stencil kernels on emerging Non-Uniform Memory Access (NUMA) many-core architectures, through a combined compiler + runtime approach. In particular, we use a pragma-driven compiler to recognize the special structures and optimization needs of stencil computations and thereby to automatically generate low-level code that efficiently utilize the data placement and management support of a C++ runtime on top of NUMA API, a programming interface to the NUMA policy supported by the Linux kernel. Our results show that through automated specialization of code generation, this approach provides a combined benefit of performance, portability, and productivity for developers.

[1]  Barbara M. Chapman,et al.  Performance Oriented Programming for NUMA Architectures , 2001, WOMPAT.

[2]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Chris Johnson,et al.  Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[4]  Barbara M. Chapman,et al.  Enabling locality-aware computations in OpenMP , 2010, Sci. Program..

[5]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[6]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[7]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[8]  J. Shalf,et al.  Lawrence Berkeley National Laboratory Recent Work Title Auto-Tuning the 27-point Stencil for Multicore Permalink , 2009 .

[9]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[10]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  David A. Padua,et al.  Compiler Techniques for the Distribution of Data and Computation , 2003, IEEE Trans. Parallel Distributed Syst..

[12]  Lixia Liu,et al.  Improving parallelism and locality with asynchronous algorithms , 2010, PPoPP '10.

[13]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[14]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[15]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[17]  Eduard Ayguadé,et al.  Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[18]  Robert Strzodka,et al.  NUMA Aware Iterative Stencil Computations on Many-Core Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[20]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).