Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

There is a widening gap between GPU and other components (CPU, PCIe bus and communication network) in heterogeneous parallel system. The gap forces us to orchestrate cooperative execution among these components much more carefully than ever before. By taking the LINPACK benchmark as a case study, this article proposes a fine-grained pipelining algorithm on large-scale CPU-GPU heterogeneous cluster systems. First, we build an algorithmic model that reveals a new approach to GPU-centric and fine-grained pipelining algorithm design. Then, we present four model-driven pipelining algorithms that incrementally squeeze bubbles in the pipeline so that it is occupied by more useful floating-point calculations. The algorithms are implemented on both the AMD and NVIDIA GPU platforms. The finally optimized LINPACK program achieves 107 PFlops on 25, 600 GPUs (70 percent floating-point efficiency). Several insights have been drawn to suggest tradeoff of algorithm design, programming support, and architecture design.

[1]  Volker Lindenstruth,et al.  A Load-Distributed Linpack Implementation for Heterogeneous Clusters , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[2]  Charles H. Romine,et al.  $LU$ Factorization Algorithms on Distributed-Memory Multiprocessor Architectures , 1988 .

[3]  Volker Lindenstruth,et al.  Optimized HPL for AMD GPU and multi-core CPU usage , 2011, Computer Science - Research and Development.

[4]  Jungwon Kim,et al.  Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Cédric Augonnet,et al.  StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators , 2012, EuroMPI.

[6]  Jack J. Dongarra,et al.  Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting , 2014, Concurr. Comput. Pract. Exp..

[7]  Robert A. van de Geijn,et al.  Scalability Issues Affecting the Design of a Dense Linear Algebra Library , 1994, J. Parallel Distributed Comput..

[8]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[9]  Jack Dongarra,et al.  Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .

[10]  Volker Lindenstruth,et al.  A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[11]  Dhabaleswar K. Panda,et al.  A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[13]  Satoshi Matsuoka,et al.  Linpack evaluation on a supercomputer with heterogeneous accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Rafael Mayo,et al.  Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.

[15]  Jack J. Dongarra,et al.  Accelerating Numerical Dense Linear Algebra Calculations with GPUs , 2014, Numerical Computations with GPUs.

[16]  Emmanuel Agullo,et al.  LU factorization for accelerator-based systems , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[17]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[18]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Xingjian Li,et al.  An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs , 2012, ICS '12.

[20]  Jack J. Dongarra,et al.  Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators , 2019, Euro-Par.

[21]  Eric F. van de Velde,et al.  Experiments with Multicomputer LU-decomposition , 1990, Concurr. Pract. Exp..

[22]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[25]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[27]  Jack Dongarra,et al.  Linear algebra on high performance computers , 1986 .

[28]  Bruno Raffin,et al.  X-kaapi: A Multi Paradigm Runtime for Multicore Architectures , 2013, 2013 42nd International Conference on Parallel Processing.

[29]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[30]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[31]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[32]  Jack Dongarra,et al.  Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems , 2015, Supercomput. Front. Innov..

[33]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[34]  J. Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Jeremy Du Croz,et al.  Factorizations of Band Matrices Using Level 3 BLAS , 1990, CONPAR.

[36]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[37]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[38]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .