论文信息 - Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

This paper presents a system for automatically supporting the optimization of stencil kernels on emerging Non-Uniform Memory Access (NUMA) many-core architectures, through a combined compiler + runtime approach. In particular, we use a pragma-driven compiler to recognize the special structures and optimization needs of stencil computations and thereby to automatically generate low-level code that efficiently utilize the data placement and management support of a C++ runtime on top of NUMA API, a programming interface to the NUMA policy supported by the Linux kernel. Our results show that through automated specialization of code generation, this approach provides a combined benefit of performance, portability, and productivity for developers.

Qing Yi | Daniel J. Quinlan | Chunhua Liao | Pei-Hung Lin | Yongqing Yan

[1] Barbara M. Chapman,et al. Performance Oriented Programming for NUMA Architectures , 2001, WOMPAT.

[2] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Chris Johnson,et al. Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[4] Barbara M. Chapman,et al. Enabling locality-aware computations in OpenMP , 2010, Sci. Program..

[5] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[6] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[7] Robert J. Fowler,et al. NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[8] J. Shalf,et al. Lawrence Berkeley National Laboratory Recent Work Title Auto-Tuning the 27-point Stencil for Multicore Permalink , 2009 .

[9] Torsten Hoefler,et al. NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[10] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11] David A. Padua,et al. Compiler Techniques for the Distribution of Data and Computation , 2003, IEEE Trans. Parallel Distributed Syst..

[12] Lixia Liu,et al. Improving parallelism and locality with asynchronous algorithms , 2010, PPoPP '10.

[13] Cheng Wang,et al. Data locality enhancement by memory reduction , 2001, ICS '01.

[14] Joseph Antony,et al. Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[15] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[17] Eduard Ayguadé,et al. Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[18] Robert Strzodka,et al. NUMA Aware Iterative Stencil Computations on Many-Core Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19] Qing Yi,et al. POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[20] Jonathan Harris,et al. Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).