623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores

In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner–outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world’s largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.

[1]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[2]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[3]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[4]  Chao Yang,et al.  Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.

[5]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[6]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[7]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Ulrich Rüde,et al.  Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[9]  Chao Yang,et al.  Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[10]  Massimiliano Fatica,et al.  A CUDA Implementation of the High Performance Conjugate Gradient Benchmark , 2014, PMBS@SC.

[11]  Chao Yang,et al.  Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Takeshi Iwashita,et al.  Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method , 2004, International Journal of Parallel Programming.

[13]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[14]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[15]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[16]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[17]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[19]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Erik Hagersten,et al.  Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors , 2006, ICS '06.

[21]  Jérémie Allard,et al.  Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.