Multi-role SpTRSV on Sunway Many-Core Architecture

Sparse triangular solver (SpTRSV) is an important and indispensable building block for many scientific applications. The parallelism of SpTRSV is exploited using Level-Set method in literature, however this method still suffers from high synchronization cost and irregular global memory access especially on many-core architecture such as Sunway. In this paper, we propose an efficient implementation of SpTRSV using the massive computing resources on Sunway architecture. Specifically, we divide the 64 CPEs in a core group into three different roles, worker, router and storer. We also build a logical shared memory by carefully manipulating the scratchpad memory located in each storer and allow synchronization using the unique register communication on Sunway architecture. We partition the sparse matrix into multiple bands and replace the irregular global memory accesses with shared memory accesses, which significantly improves the data locality during the calculation of a band. Our experiments with 12 representative datasets demonstrate that our approach achieves up to 5.14x (2.65x on average) speedup.

[1]  Li Kenli,et al.  Implementing Molecular Dynamics Simulation on Sunway TaihuLight System , 2016 .

[2]  Yves Robert,et al.  STS-k: a multilevel sparse triangular solution scheme for NUMA multicores , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Chi Xue-bin,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016 .

[4]  Mahmut T. Kandemir,et al.  Optimizing Data Layouts for Parallel Computation on Multicores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[7]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Wenguang Chen,et al.  Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[12]  Thomas R. Gross,et al.  Synchronized-by-Default Concurrency for Shared-Memory Systems , 2017, PPOPP.

[13]  Yousef Saad,et al.  Solving Sparse Triangular Linear Systems on Parallel Computers , 1989, Int. J. High Speed Comput..

[14]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[15]  Idit Keidar,et al.  SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools , 2012, SPAA '12.

[16]  Brian Vinter,et al.  A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[17]  David A. Padua,et al.  Hydra: Automatic algorithm exploration from linear algebra equations , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[18]  Jan Mayer,et al.  Parallel algorithms for solving linear systems with sparse triangular matrices , 2009, Computing.

[19]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[20]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[21]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[22]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[23]  Erik G. Boman,et al.  Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.

[24]  Edmond Chow,et al.  Fine-Grained Parallel Incomplete LU Factorization , 2015, SIAM J. Sci. Comput..