Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoidsdata transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4× compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.

[1]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[2]  Laura Grigori,et al.  Adapting communication-avoiding LU and QR factorizations to multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[5]  Jack Dongarra,et al.  A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[6]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[7]  Laura Grigori,et al.  A Class of Communication-avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines , 2012, ICCS.

[8]  J. Demmel,et al.  Implementing Communication-Optimal Parallel and Sequential QR Factorizations , 2008, 0809.2407.

[9]  Jack J. Dongarra,et al.  Exploiting Fine-Grain Parallelism in Recursive LU Factorization , 2011, PARCO.

[10]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[11]  Jack J. Dongarra,et al.  LU Factorization with Partial Pivoting for a Multicore System with Accelerators , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  William Gropp,et al.  Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Jack Dongarra,et al.  Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .

[14]  Jack J. Dongarra,et al.  Multi-GPU Implementation of LU Factorization , 2012, ICCS.

[15]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[16]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[18]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[19]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[20]  Weichung Wang,et al.  Tuning Block Size for QR Factorization on CPU-GPU Hybrid Systems , 2012, 2012 IEEE 6th International Symposium on Embedded Multicore SoCs.