Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures ∗ Fengguang Song Stanimire Tomov Jack Dongarra University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee Oak Ridge National Laboratory University of Manchester song@eecs.utk.edu tomov@eecs.utk.edu dongarra@eecs.utk.edu ABSTRACT We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our ap- proach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized commu- nication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allo- cate data to the host system and GPUs to minimize commu- nication. We have designed heterogeneous algorithms with two different tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scal- ability, strong scalability, load balance, and efficiency of our approach. INTRODUCTION As the performance of both multicore CPU and GPU con- tinues to scale at a Moore’s law rate, it is becoming perva- sive to use heterogeneous multicore and multi-GPU archi- tectures to attain the highest performance possible from a single compute node. Before making parallel programs run efficiently on a distributed-memory system, it is critical to achieve high performance on a single node first. However, the heterogeneity in the multi-core and multi-GPU architec- ture has introduced new challenges to algorithm design and system software. Over the last few years, our colleagues at the Univer- sity of Tennessee have developed the PLASMA library [2] to solve linear algebra problems on multicore architectures. In parallel with PLASMA, we have also developed another library called MAGMA [27] to solve linear algebra problems on GPUs. While PLASMA and MAGMA aim to provide the same routines as LAPACK [4], one is used for multicore CPUs, and the other for a single core with an attached GPU, respectively. Our goal is to utilize all cores and all GPUs efficiently on a single multicore and multi-GPU system to support matrix computations. ∗ This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE- FC02-06ER25761, and by Microsoft Research. GPU Device Memory Multicore Host System Host Memory PCIe Interface GPU Switch PCIe Interface GPU Switch GPU Device Memory GPU Device Memory GPU Device Memory Figure 1: An example of a heterogeneous multi-core and multi-GPU system. The host system is connected to four GPUs via two PCI Express connections. The host system and the GPUs have separate memory spaces. Figure 1 shows the architecture of a heterogeneous mul- ticore and multi-GPU system we are considering. The mul- ticore host system is connected to four GPUs via two PCI Express connections and each pair of GPUs share a GPU switch. To design new software on this type of heteroge- neous architectures, we must consider the following special features: (1) The host and the GPUs have different memory spaces and an explicit memory copy is required to transfer data between the host and a GPU; (2) The system is also dif- ferent from a distributed-memory machine since each GPU is actually controlled by a thread running on the host (more like pthreads on a shared-memory machine); (3) The pro- cessor heterogeneity between CPUs and GPUs; (4) GPUs are optimized for throughput and expect a larger input size than CPUs which are optimized for latency [24]; (5) As the performance gap between a GPU and its PCI-Express in- terconnection to the host becomes larger, network is even- tually the bottleneck for the entire system. In this work, we take into account all these factors and strive to meet the following objectives in order to obtain high performance: a high degree of parallelism, minimized synchronization, min- imized communication, and load balancing. We propose to design new heterogeneous algorithms and to use a simple but practical static data distribution to achieve the objec- tives simultaneously. This paper describes heterogeneous rectangular tile algo- rithms with hybrid tile sizes, heterogeneous 1-D block cyclic data distribution, a new runtime system, and an auto-tuning method to determine the hybrid tile sizes. The rectangu- lar tile algorithms build upon the previous tile algorithms, which divide a matrix into square tiles and exhibit a high de- gree of parallelism and minimized synchronizations [13, 14]

[1]  References , 1971 .

[2]  Francisco D. Igual,et al.  Retargeting PLAPACK to Clusters with Hardware Accelerators FLAME Working Note # 42 , 2010 .

[3]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[4]  Yves Robert,et al.  A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.

[5]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[7]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[8]  Zhiling Lan,et al.  A novel dynamic load balancing scheme for parallel systems , 2002, J. Parallel Distributed Comput..

[9]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[11]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Laxmikant V. Kalé,et al.  Scaling Hierarchical N-body Simulations on GPU Clusters , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[14]  Jack Dongarra,et al.  The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines , 1997 .

[15]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[17]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[18]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[20]  Jack J. Dongarra,et al.  The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines , 2000, Concurr. Pract. Exp..

[21]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[22]  Yves Robert,et al.  Static tiling for heterogeneous computing platforms , 1999, Parallel Comput..

[23]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[24]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[25]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[26]  Alexey L. Lastovetsky,et al.  Data distribution for dense factorization on computers with memory heterogeneity , 2007, Parallel Comput..

[27]  Robert A. van de Geijn,et al.  Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.