Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture

To improve the performance of sparse Cholesky factorization, existing research divides the adjacent columns of the sparse matrix with the same nonzero patterns into supernodes for parallelization. However, due to the various structures of sparse matrices, the computation of the generated supernodes varies significantly, and thus hard to optimize when computed by dense matrix kernels. Therefore, how to efficiently map sparse Choleksy factorization to the emerging architectures, such as Sunway many-core processor, remains an active research direction. In this article, we propose swCholesky, which is a highly optimized implementation of sparse Cholesky factorization on Sunway processor. Specifically, we design three kernel task queues and a dense matrix library to dynamically adapt to the kernel characteristics and architecture features. In addition, we propose an auto-tuning mechanism to search for the optimal settings of the important parameters in swCholesky. Our experiments show that swCholesky achieves better performance than state-of-the-art implementations.

[1]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[2]  Jesús Labarta,et al.  Variable Batched DGEMM , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[3]  Meng Zhang,et al.  Redesigning LAMMPS for Peta-Scale and Hundred-Billion-Atom Simulation on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[5]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[6]  Sanjay Ranka,et al.  A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization , 2018, ICPP.

[7]  Stanimire Tomov,et al.  A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations , 2018, IEEE Transactions on Parallel and Distributed Systems.

[8]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[9]  Guangwen Yang,et al.  swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[10]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[11]  Simon D. Hammond,et al.  Designing Vector-Friendly Compact BLAS and LAPACK Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Shoaib Kamil,et al.  Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Timothy A. Davis,et al.  Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[14]  Dror Irony,et al.  Parallel and fully recursive multifrontal sparse Cholesky , 2004, Future Gener. Comput. Syst..

[15]  Weifeng Liu,et al.  swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures , 2018, PPoPP.

[16]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[17]  Alfredo Buttari,et al.  Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..

[18]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Ivan V. Oseledets,et al.  "Compress and eliminate" solver for symmetric positive definite sparse matrices , 2016, SIAM J. Sci. Comput..

[20]  J. T. Smith Conservative modeling of 3-D electromagnetic fields, Part II: Biconjugate gradient solution and an accelerator , 1996 .

[21]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[22]  Dianne P. O'Leary,et al.  Data-flow algorithms for parallel matrix computation , 1985, CACM.

[23]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[24]  Depei Qian,et al.  Multi-role SpTRSV on Sunway Many-Core Architecture , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[25]  V. Natoli,et al.  Exploring New Architectures in Accelerating CFD for Air Force Applications , 2008, 2008 DoD HPCMP Users Group Conference.

[26]  Depei Qian,et al.  swMR: A Framework for Accelerating MapReduce Applications on Sunway Taihulight , 2018 .

[27]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[28]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[29]  S. Treitel,et al.  A REVIEW OF LEAST-SQUARES INVERSION AND ITS APPLICATION TO GEOPHYSICAL PROBLEMS* , 1984 .

[30]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[31]  Jack Dongarra,et al.  Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA , 2011 .

[32]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[33]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[34]  Enrique S. Quintana-Ortí,et al.  Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors , 2019, Parallel Comput..

[35]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[36]  Wenguang Chen,et al.  ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[38]  Shoaib Kamil,et al.  ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Weiguo Liu,et al.  Redesigning CAM-SE for Peta-Scale Climate Modeling Performance and Ultra-High Resolution on Sunway TaihuLight , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Xin Liu,et al.  Towards Efficient SpMV on Sunway Manycore Architectures , 2018, ICS.

[41]  Timothy A. Davis,et al.  Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves , 2009, TOMS.

[42]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[43]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[44]  N. Moës,et al.  Improved implementation and robustness study of the X‐FEM for stress analysis around cracks , 2005 .

[45]  Sivan Toledo,et al.  Elimination Structures in Scientific Computing , 2004, Handbook of Data Structures and Applications.

[46]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[47]  Wolfgang Fichtner,et al.  PARDISO: a high-performance serial and parallel sparse linear solver in semiconductor device simulation , 2001, Future Gener. Comput. Syst..

[48]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[49]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[50]  Wei Zhang,et al.  Simulating the Wenchuan Earthquake with Accurate Surface Topography on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[52]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[53]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[54]  Jack Dongarra,et al.  A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .