Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.

[1]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[3]  Kent Wilken,et al.  Optimal instruction scheduling using integer programming , 2000, PLDI.

[4]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[5]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[6]  John A. Gunnels,et al.  Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Sanjit A. Seshia,et al.  Sketching stencils , 2007, PLDI '07.

[8]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[9]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[10]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[11]  Helmar Burkhart,et al.  Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.

[12]  Chung-Ta King,et al.  Using integer linear programming for instruction scheduling and register allocation in multi-issue processors , 1997 .

[13]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[16]  David E. Keyes,et al.  Exaflop/s: The why and the how , 2011 .

[17]  Archana Ganapathi,et al.  A case for machine learning to optimize multicore performance , 2009 .

[18]  Abid M. Malik,et al.  Constraint Programming Techniques for Optimal Instruction Scheduling , 2008 .

[19]  Chi-Wang Shu,et al.  High order finite difference and finite volume WENO schemes and discontinuous Galerkin methods for CFD , 2001 .

[20]  Konstantin Makarychev,et al.  Indexing genomic sequences on the IBM Blue Gene , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[22]  Wu-chun Feng,et al.  The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.

[23]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[24]  Dharmendra S. Modha,et al.  The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[26]  LiZhiyuan,et al.  Automatic tiling of iterative stencil loops , 2004 .

[27]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[28]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[29]  Mauricio Hanzich,et al.  3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, Sci. Program..

[30]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[31]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[32]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[33]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[34]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[35]  Zhiyuan Li,et al.  Automatic tiling of iterative stencil loops , 2004, TOPL.

[36]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[37]  M. Suzuoki,et al.  Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[38]  Samuel Williams,et al.  Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[39]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[40]  FengWu-chun,et al.  The Green500 List , 2007 .

[41]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[43]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[44]  Mauricio Hanzich,et al.  3 D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2014 .

[45]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[46]  Ibm Redbooks,et al.  IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[47]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[48]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[49]  Samuel Williams,et al.  Auto-Tuning the 27-point Stencil for Multicore , 2009 .