Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm

In this paper, we propose a hybrid algorithm to enable and accelerate the High Performance Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number of accelerators. In the hybrid algorithm, each subdomain is assigned to a node after a three-dimensional domain decomposition. The subdomain is further divided to several regular inner blocks and an outer part with a flexible inner-outer partitioning strategy. Each inner task is assigned to a MIC device and the size is adjustable to adapt the accelerator's computational power. The only outer part is assigned to CPU and the thickness of boundary size is also adjustable to maintain load balance between CPU and MICs. By properly fusing the computational kernels with preceding ones, we present an asynchronous data transfer scheme to better overlap local computation with the PCI-express data transfer. All basic HPCG kernels, especially the time-consuming sparse matrix-vector multiplication (SpMV) and the symmetric Gauss-Seidel relaxation (SymGS), are extensively optimized for both CPU and MIC, on both algorithmic and architectural levels. On a single node of Tianhe-2 which is composed of an Intel Xeon processor and three Intel Xeon Phi coprocessors, we successfully obtain an aggregated performance of 50.2 Gflops, which is around 1.5% of the peak performance.

[1]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[2]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[3]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[4]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[5]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jérémie Allard,et al.  Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[7]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[8]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[9]  Sandia Report,et al.  HPCG Technical Specification , 2013 .

[10]  Francisco Tirado,et al.  Vectorization of multigrid codes using SIMD ISA extensions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[11]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[12]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[14]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[15]  Ulrich Rüde,et al.  Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[16]  Chao Yang,et al.  Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.

[17]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[18]  Erik Hagersten,et al.  Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors , 2006, ICS '06.