Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

[1]  David A. Bader,et al.  A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Sudip K. Seal,et al.  Efficient simulation of agent-based models on multi-GPU and multi-core clusters , 2010, SimuTools.

[4]  Przemyslaw Stpiczynski,et al.  Solving a kind of BVP for ODEs on heterogeneous CPU + CUDA-enabled GPU systems , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[5]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[6]  Jean-François Méhaut,et al.  Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures. , 2009, The Journal of chemical physics.

[7]  Ying Gao,et al.  Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform , 2011, OPSR.

[8]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[9]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[10]  Roger L. Davis,et al.  GPGPU parallel algorithms for structured-grid CFD codes , 2011 .

[11]  Jack Dongarra,et al.  A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs , 2012 .

[12]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[13]  Hamid Laga,et al.  CUDA (Computer Unified Device Architecture) , 2009 .

[14]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[15]  Hee-Seok Kim,et al.  A Scalable Tridiagonal Solver for GPUs , 2011, 2011 International Conference on Parallel Processing.

[16]  Eduardo F. D'Azevedo,et al.  Parallel LU Factorization on GPU Cluster , 2012, ICCS.

[17]  S. Hirshman,et al.  SIESTA: A scalable iterative equilibrium solver for toroidal applications , 2011 .

[18]  Diego Rossinelli,et al.  Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids , 2011, SIAM J. Sci. Comput..

[19]  Vickie E. Lynch,et al.  BCYCLIC: A parallel block tridiagonal matrix cyclic solver , 2010, J. Comput. Phys..

[20]  Jack Dongarra,et al.  A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[21]  Leonel Sousa,et al.  Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters , 2012, Euro-Par.

[22]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[23]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[24]  Tetsuro Nishino,et al.  Realtime 3D profilometer using GPU and multicore CPU , 2011 .

[25]  Jeff R. Hammond,et al.  Quantum Chemical Many-Body Theory on Heterogeneous Nodes , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[26]  Yao Zhang,et al.  An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Emmanuel Agullo,et al.  LU factorization for accelerator-based systems , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[28]  Bo Li,et al.  memCUDA: Map Device Memory to Host Memory on GPGPU Platform , 2010, NPC.

[29]  Héctor Migallón Gomis,et al.  GPU-based parallel algorithms for sparse nonlinear systems , 2012, J. Parallel Distributed Comput..

[30]  David M. Nicol,et al.  Acceleration of wireless channel simulation using GPUs , 2010, 2010 European Wireless Conference (EW).