Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemp...

[1]  Willi Jäger,et al.  High Performance Computing in Science and Engineering ’02 , 2003 .

[2]  Guang R. Gao,et al.  Locality Optimization of Stencil Applications Using Data Dependency Graphs , 2010, LCPC.

[3]  Ulrich Rüde,et al.  Challenges and Potentials of Emerging Multicore Architectures , 2009 .

[4]  Gerhard Wellein,et al.  Quantum Transport within a Background Medium: Fluctuations versus Correlations , 2009 .

[5]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Gerhard Wellein,et al.  Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Ulrich Rüde,et al.  Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[9]  Gerhard Wellein,et al.  Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering , 2012, Euro-Par Workshops.

[10]  Chun Chen,et al.  Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters , 2012, The Journal of Supercomputing.

[11]  Gabriel Wittum,et al.  Competence in High Performance Computing 2010 , 2012, Springer Berlin Heidelberg.

[12]  Rolf Rannacher,et al.  Modeling, Simulation and Optimization of Complex Processes: Proceedings of the International Conference on High Performance Scientific Computing, March 10-14, 2003, Hanoi, Vietnam , 2005 .

[13]  D. Wonnacott,et al.  On the Scalability of Loop Tiling Techniques , 2012 .

[14]  Siegfried Wagner,et al.  High Performance Computing in Science and Engineering, Munich 2004 , 2008 .

[15]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Gerhard Wellein,et al.  Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Gerhard Wellein,et al.  Exact Diagonalization Results for Strongly Correlated Electron-Phonon Systems , 2002 .

[18]  David E. Keyes,et al.  Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking , 2014, ArXiv.

[19]  Gerhard Wellein,et al.  Towards the Limits of present-day Supercomputers: Exact Diagonalization of Strongly Correlated Electron-Phonon Systems , 2000 .

[20]  Gerhard Wellein,et al.  Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips , 2014, WPMVP '14.

[21]  Gerhard Wellein,et al.  Jacobi-Davidson Algorithm with Fast Matrix-Vector Multiplikation on Massively Parallel and Vector Supercomputers , 2001 .

[22]  Gerhard Wellein,et al.  Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms , 2011 .

[23]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[24]  Hager Georg,et al.  Optimization Techniques for Modern High Performance Computers , 2008 .

[25]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[26]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[27]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[28]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[29]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Michael M. Resch,et al.  High performance computing in science and engineering , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[31]  A. R. Bishop,et al.  Spatiotemporal evolution of polaronic states in finite quantum systems , 2010, 1008.1864.

[32]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[33]  Willi Jäger,et al.  High Performance Computing in Science and Engineering ’99 , 2000 .

[34]  Albert Cohen,et al.  The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..

[35]  Guang R. Gao,et al.  Mapping the FDTD Application to Many-Core Chip Architectures , 2009, 2009 International Conference on Parallel Processing.

[36]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[37]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[38]  Willi Jäger,et al.  High Performance Computing in Science and Engineering ’01 , 2002, Springer Berlin Heidelberg.

[39]  Gerhard Wellein,et al.  Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture , 2003 .

[40]  Guang R. Gao,et al.  Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading , 2014, LCPC.

[41]  Gerhard Wellein,et al.  Direct Numerical Simulation of Turbulent Flow Over Dimples – Code Optimization for NEC SX-8 plus Flow Results , 2008 .

[42]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[43]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[44]  Gerhard Wellein,et al.  Exact Numerical Treatment of Finite Quantum Systems Using Leading-Edge Supercomputers , 2003, HPSC.

[45]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[46]  Xing Zhou,et al.  Tiling optimizations for stencil computations , 2013 .

[47]  Gerhard Wellein,et al.  Density-Matrix Algorithm for Phonon Hilbert Space Reduction in the Numerical Diagonalization of Quantum Many-Body Systems , 2002 .

[48]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[49]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[50]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[51]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[52]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[53]  Gerhard Wellein,et al.  One-Dimensional Electron-Phonon Systems: Mott- Versus Peierls-Insulators , 2003 .

[54]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[55]  Gerhard Wellein,et al.  RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks , 2007, ArXiv.

[56]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[57]  Gerhard Wellein,et al.  Have the Vectors the Continuing Ability to Parry the Attack of the Killer Micros , 2006 .

[58]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[59]  Gerhard Wellein,et al.  DMRG Investigation of Stripe Formation in Doped Hubbard Ladders , 2005 .

[60]  Gerhard Wellein,et al.  Performance of Scientific Applications on Modern Supercomputers , 2005 .

[61]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.