论文信息 - Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

Kalyan S. Perumalla | Alfred Park

[1] David A. Bader,et al. A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2] Kunle Olukotun,et al. Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3] Sudip K. Seal,et al. Efficient simulation of agent-based models on multi-GPU and multi-core clusters , 2010, SimuTools.

[4] Przemyslaw Stpiczynski,et al. Solving a kind of BVP for ODEs on heterogeneous CPU + CUDA-enabled GPU systems , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[5] Yao Zhang,et al. Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[6] Jean-François Méhaut,et al. Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures. , 2009, The Journal of chemical physics.

[7] Ying Gao,et al. Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform , 2011, OPSR.

[8] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[9] Avi Mendelson,et al. Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[10] Roger L. Davis,et al. GPGPU parallel algorithms for structured-grid CFD codes , 2011 .

[11] Jack Dongarra,et al. A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs , 2012 .

[12] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[13] Hamid Laga,et al. CUDA (Computer Unified Device Architecture) , 2009 .

[14] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[15] Hee-Seok Kim,et al. A Scalable Tridiagonal Solver for GPUs , 2011, 2011 International Conference on Parallel Processing.

[16] Eduardo F. D'Azevedo,et al. Parallel LU Factorization on GPU Cluster , 2012, ICCS.

[17] S. Hirshman,et al. SIESTA: A scalable iterative equilibrium solver for toroidal applications , 2011 .

[18] Diego Rossinelli,et al. Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids , 2011, SIAM J. Sci. Comput..

[19] Vickie E. Lynch,et al. BCYCLIC: A parallel block tridiagonal matrix cyclic solver , 2010, J. Comput. Phys..

[20] Jack Dongarra,et al. A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[21] Leonel Sousa,et al. Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters , 2012, Euro-Par.

[22] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[23] Robert Strzodka,et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[24] Tetsuro Nishino,et al. Realtime 3D profilometer using GPU and multicore CPU , 2011 .

[25] Jeff R. Hammond,et al. Quantum Chemical Many-Body Theory on Heterogeneous Nodes , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[26] Yao Zhang,et al. An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27] Emmanuel Agullo,et al. LU factorization for accelerator-based systems , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[28] Bo Li,et al. memCUDA: Map Device Memory to Host Memory on GPGPU Platform , 2010, NPC.

[29] Héctor Migallón Gomis,et al. GPU-based parallel algorithms for sparse nonlinear systems , 2012, J. Parallel Distributed Comput..

[30] David M. Nicol,et al. Acceleration of wireless channel simulation using GPUs , 2010, 2010 European Wireless Conference (EW).