Parallel Transposition of Sparse Data Structures

Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other important building blocks, e.g., parallel transposition for sparse matrices and graphs, have not received the attention they deserve. In this paper, we first identify that the transposition operation can be a bottleneck of some fundamental sparse matrix and graph algorithms. Then, we revisit the performance and scalability of parallel transposition approaches on x86-based multi-core and many-core processors. Based on the insights obtained, we propose two new parallel transposition algorithms: ScanTrans and MergeTrans. The experimental results show that our ScanTrans method achieves an average of 2.8-fold (up to 6.2-fold) speedup over the parallel transposition in the latest vendor-supplied library on an Intel multi-core CPU platform, and the MergeTrans approach achieves on average of 3.4-fold (up to 11.7-fold) speedup on an Intel Xeon Phi many-core processor.

[1]  Lars Karlsson,et al.  Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[2]  Franz Franchetti,et al.  Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets , 2011, ICS '11.

[3]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[4]  Pradeep Dubey,et al.  Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms , 2015, ISC.

[5]  Frank Dellaert,et al.  Square Root SAM: Simultaneous Localization and Mapping via Square Root Information Smoothing , 2006, Int. J. Robotics Res..

[6]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Ingemar J. Cox,et al.  Dynamic Map Building for an Autonomous Mobile Robot , 1990, EEE International Workshop on Intelligent Robots and Systems, Towards a New Frontier of Applications.

[8]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[9]  P. Sadayappan,et al.  An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs , 2014, ICS '14.

[10]  Hiroshi Inoue,et al.  SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures , 2015, Proc. VLDB Endow..

[11]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[12]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[13]  Juan Gómez-Luna,et al.  In-place transposition of rectangular matrices on accelerators , 2014, PPoPP '14.

[14]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[15]  Roland W. Freund,et al.  A Transpose-Free Quasi-Minimal Residual Algorithm for Non-Hermitian Linear Systems , 1993, SIAM J. Sci. Comput..

[16]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[17]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[18]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[19]  References , 1971 .

[20]  Patrick R. Amestoy,et al.  An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[21]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[22]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[23]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[24]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  R. Fletcher Conjugate gradient methods for indefinite systems , 1976 .

[26]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[27]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[28]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[29]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[30]  Yves Robert,et al.  STS-k: a multilevel sparse triangular solution scheme for NUMA multicores , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  David A. Bader,et al.  Graph Partitioning and Graph Clustering , 2013 .

[32]  Michael Garland,et al.  A decomposition for in-place matrix transposition , 2014, PPoPP '14.

[33]  Wu-chun Feng,et al.  AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[34]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[35]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[36]  Wu-chun Feng,et al.  ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors , 2015, ICS.

[37]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[38]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[39]  Ingemar J. Cox,et al.  Dynamic Map Building for an Autonomous Mobile Robot , 1992 .

[40]  Kunle Olukotun,et al.  On fast parallel detection of strongly connected components (SCC) in small-world graphs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  Wu-chun Feng,et al.  cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[42]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[43]  R. Freund,et al.  QMR: a quasi-minimal residual method for non-Hermitian linear systems , 1991 .

[44]  Ulrich Meyer,et al.  GPU multisplit , 2016, PPoPP.

[45]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[46]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[47]  Rafael Asenjo,et al.  Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , 2016, PPOPP.

[48]  Lawrence Rauchwerger,et al.  Finding strongly connected components in distributed graphs , 2005, J. Parallel Distributed Comput..

[49]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.