Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.

[1]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[2]  Satoshi Matsuoka,et al.  Massive supercomputing coping with heterogeneity of modern accelerators , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[4]  Walid A. Najjar,et al.  Compiled hardware acceleration of Molecular Dynamics code , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[5]  John E. Stone,et al.  Long time-scale simulations of in vivo diffusion using GPU hardware , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[8]  John A. Gunnels,et al.  Petascale computing with accelerators , 2009, PPoPP '09.

[9]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[10]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[11]  Kun Zhou,et al.  BSGP: bulk-synchronous GPU programming , 2008, ACM Trans. Graph..

[12]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Satoshi Matsuoka,et al.  Power-aware dynamic task scheduling for heterogeneous accelerated clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[15]  Daniel A. Brokenshire,et al.  Introduction to the Cell Broadband Engine Architecture , 2007, IBM J. Res. Dev..