A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems

This paper presents the first sparse direct solver for distributed memory systems comprising hybrid multicourse CPU and Intel Xeon Pico-processors. It builds on the algorithmic approach of SuperLU_DIST, which is right-looking and statically pivoted. Our contribution is a novel algorithm, called the HALO. The name is shorthand for highly asynchronous lazy offload, it refers tithe way the algorithm combines highly aggressive use of asynchrony with accelerated offload, lazy updates, and data shadowing (a la halo or ghost zones), all of which serve to hide and reduce communication, whether to local memory, across the network, or over PCIe. We further augment HALO with a model-driven autotuning heuristicthat chooses the intra-node division of labor among CPU and Xeon Pico-processor components. When integrated into SuperLU_DIST and evaluated on a variety of realistic test problems in both single-node and multi-node configurations, the resulting implementation achieves speedups of unto 2.5× over an already efficient multicourse CPU implementation, and achieves up to 83% of a machine-specific upper-bound that we haveestimated. Our analysis quantifies how well our implementation performs and allows us to speculate on the potential speedups that might come from variety of future improvements to the algorithm and system.

[1]  Chenhan D. Yu,et al.  A CPU-GPU hybrid approach for the unsymmetric multifrontal method , 2011, Parallel Comput..

[2]  Pradeep Dubey,et al.  Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[4]  Roger Grimes,et al.  Multifrontal Computations on GPUs and Their Multi-core Hosts , 2010, VECPAR.

[5]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[6]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[7]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[9]  FengWu-chun,et al.  The Green500 List , 2007 .

[10]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[11]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[13]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[15]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[16]  Iain S. Duff,et al.  Direct methods for sparse matrices27100 , 1986 .

[17]  Victor Eijkhout,et al.  Scheduling a Parallel Sparse Direct Solver to Multiple GPUs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[18]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[19]  Richard W. Vuduc,et al.  A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.

[20]  Gene Poole,et al.  Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .

[21]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[22]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.