Domain-Specific Optimization of Two Jacobi Smoother Kernels and Their Evaluation in the ECM Performance Model

Our aim is to apply program transformations to stencil codes in order to yield the highest possible performance. We recognize memory bandwidth as a major limitation in stencil code performance. We conducted a study in which we applied optimizing transformations to two Jacobi smoother kernels: one 3D 1st-order 7-point stencil and one 3D 3rd-order 19-point stencil. To obtain high performance, the optimizations have to be customized for the execution platform at hand. We illustrate this by experiments on two consumer and two server architectures. We also veried the need for complex optimizations with the help of the Execution-Cache-Memory performance model. A code generator with knowledge about stencil codes and execution platforms should be able to apply our transformations automatically. We are working towards such a generator in project ExaStencils.

[1]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[2]  Christian Lengauer,et al.  Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation , 2014 .

[3]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[4]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[5]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[7]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, & Tools with Gradiance , 2007 .

[8]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[9]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[10]  Matthias Bolten,et al.  Multigrid methods for structured grids and their application in particle simulation , 2008 .

[11]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[12]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[14]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[15]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[16]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[17]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[18]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[19]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[20]  Weiqiang Wang,et al.  In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.

[21]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.