论文信息 - Reducing Communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2

Reducing Communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2

The High Performance Conjugate Gradient (HPCG) benchmark, proposed recently in 2013, has drawn increasingly more attention from both academia and industry. Unlike the High Performance Linpack (HPL) benchmark, which has a very high computation-to-communication ratio, HPCG contains both neigh boring and global communication that may severely degrade the parallel performance. To reduce the communication overhead of neigh boring communications, we overlap halo updates with halo-independent computations. To hide the cost of the global reductions in vector dot-products, we make use of two reformulated CG algorithms, namely the Gropp's asynchronous CG and the pipelined CG. Some further optimizations are done to decrease the extra overhead introduced in the reformulated CG algorithms. We show by experiments on the world's largest heterogeneous system - Tianhe-2 that the optimized HPCG code scales to 256 nodes (49,920 cores) with a nearly ideal weak scalability of over 90% and an aggregate performance of 10.51Tflops.

Yiqun Liu | Fangfang Liu | Xianyi Zhang | Yutong Lu | Chao Yang

[1] Chao Yang,et al. Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.

[2] Wim Vanroose,et al. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[3] Sandia Report,et al. Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[4] Chao Yang,et al. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[5] Sandia Report,et al. HPCG Technical Specification , 2013 .