论文信息 - Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.

[1] Liu Peng,et al. High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[3] Kent Wilken,et al. Optimal instruction scheduling using integer programming , 2000, PLDI.

[4] William L. Briggs,et al. A multigrid tutorial , 1987 .

[5] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[6] John A. Gunnels,et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7] Sanjit A. Seshia,et al. Sketching stencils , 2007, PLDI '07.

[8] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[9] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[10] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.

[11] Helmar Burkhart,et al. Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.

[12] Chung-Ta King,et al. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors , 1997 .

[13] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[16] David E. Keyes,et al. Exaflop/s: The why and the how , 2011 .

[17] Archana Ganapathi,et al. A case for machine learning to optimize multicore performance , 2009 .

[18] Abid M. Malik,et al. Constraint Programming Techniques for Optimal Instruction Scheduling , 2008 .

[19] Chi-Wang Shu,et al. High order finite difference and finite volume WENO schemes and discontinuous Galerkin methods for CFD , 2001 .

[20] Konstantin Makarychev,et al. Indexing genomic sequences on the IBM Blue Gene , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21] Georg Hager,et al. Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[22] Wu-chun Feng,et al. The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.

[23] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.

[24] Dharmendra S. Modha,et al. The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[26] LiZhiyuan,et al. Automatic tiling of iterative stencil loops , 2004 .

[27] Ibm Blue,et al. Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[28] Weiqiang Wang,et al. A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[29] Mauricio Hanzich,et al. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, Sci. Program..

[30] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[31] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[32] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[33] Ken Kennedy,et al. Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[34] Michael D. McCool,et al. Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[35] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.

[36] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[37] M. Suzuoki,et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[38] Samuel Williams,et al. Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[39] M. Berger,et al. Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[40] FengWu-chun,et al. The Green500 List , 2007 .

[41] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[42] Franz Franchetti,et al. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[43] Thomas R. Gross,et al. Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[44] Mauricio Hanzich,et al. 3 D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2014 .

[45] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[46] Ibm Redbooks,et al. IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[47] Gerhard Wellein,et al. Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[48] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[49] Samuel Williams,et al. Auto-Tuning the 27-point Stencil for Multicore , 2009 .