LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.

[1]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[2]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[3]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[4]  Volker Lindenstruth,et al.  Optimized HPL for AMD GPU and multi-core CPU usage , 2011, Computer Science - Research and Development.

[5]  Xiaoming Zhang,et al.  Hybrid hierarchy storage system in MilkyWay-2 supercomputer , 2014, Frontiers of Computer Science.

[6]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[7]  Jack J. Dongarra,et al.  Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi , 2013, PPAM.

[8]  P. G. Hipes,et al.  Gauss-Jordan inversion with pivoting on the Caltech Mark II hypercube , 1989, C3P.

[9]  Eric F. van de Velde,et al.  Experiments with Multicomputer LU-decomposition , 1990, Concurr. Pract. Exp..

[10]  Jack J. Dongarra,et al.  LU Factorization with Partial Pivoting for a Multicore System with Accelerators , 2013, IEEE Transactions on Parallel and Distributed Systems.

[11]  John A. Gunnels,et al.  Petascale computing with accelerators , 2009, PPoPP '09.

[12]  Zizhong Chen,et al.  A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..

[13]  Laurent Albera,et al.  Joint Eigenvalue Decomposition of Non-Defective Matrices Based on the LU Factorization With Application to ICA , 2015, IEEE Transactions on Signal Processing.

[14]  Canqun Yang,et al.  Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer , 2011, Journal of Computer Science and Technology.

[15]  Jack J. Dongarra,et al.  Optimization for performance and energy for batched matrix computations on GPUs , 2015, GPGPU@PPoPP.

[16]  Enrique S. Quintana-Ortí,et al.  Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[17]  Jian Li,et al.  Power-efficient time-sensitive mapping in heterogeneous systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[19]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[20]  Jungwon Kim,et al.  Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.

[21]  Mark A. Johnson,et al.  Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .

[22]  Stephen A. Jarvis,et al.  Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units , 2015, 2015 44th International Conference on Parallel Processing.

[23]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Satoshi Matsuoka,et al.  Linpack evaluation on a supercomputer with heterogeneous accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[26]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.