Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. The performance of stencil calculations is often bounded by memory bandwidth. High-bandwidth memory (HBM) on devices such as those in the Intel® Xeon Phi™ ™200 processor family (code-named Knights Landing) can thus provide additional performance. In a traditional sequential time-step approach, the additional bandwidth can be best utilized when the stencil data fits into the HBM, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As problem sizes become significantly larger than the HBM, the effective bandwidth approaches that of the DDR, degrading performance. This paper explores the use of temporal wave-front tiling to add an additional layer of cache-blocking to allow efficient use of both the HBM bandwidth and the DDR capacity. Details of the cache-blocking and wave-front tiling algorithms are given, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and HBM-cache hit rates are also provided, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide a 2.4™ speedup compared to using HBM cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes.

[1]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[2]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[4]  Philippe Thierry,et al.  Characterization and Optimization Methodology Applied to Stencil Computations , 2015 .

[5]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[6]  Charles Yount,et al.  Vector Folding: Improving Stencil Performance via Multi-dimensional SIMD-vector Representation , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[7]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[8]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[12]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[13]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[14]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[15]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[16]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[17]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).