Reducing Communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2

The High Performance Conjugate Gradient (HPCG) benchmark, proposed recently in 2013, has drawn increasingly more attention from both academia and industry. Unlike the High Performance Linpack (HPL) benchmark, which has a very high computation-to-communication ratio, HPCG contains both neigh boring and global communication that may severely degrade the parallel performance. To reduce the communication overhead of neigh boring communications, we overlap halo updates with halo-independent computations. To hide the cost of the global reductions in vector dot-products, we make use of two reformulated CG algorithms, namely the Gropp's asynchronous CG and the pipelined CG. Some further optimizations are done to decrease the extra overhead introduced in the reformulated CG algorithms. We show by experiments on the world's largest heterogeneous system - Tianhe-2 that the optimized HPCG code scales to 256 nodes (49,920 cores) with a nearly ideal weak scalability of over 90% and an aggregate performance of 10.51Tflops.