Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT’s key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3×, 5.1× and 1.6×, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9× and 16.9× over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29× and 2.94× in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).

[1]  Christopher W. Fletcher,et al.  Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract) , 2023, HOPC@SPAA.

[2]  Ümit V. Çatalyürek,et al.  On Symmetric Rectilinear Partitioning , 2022, ACM J. Exp. Algorithmics.

[3]  Jaehyuk Huh,et al.  InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-Aware Inner Product Processing , 2021, 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  J. Emer,et al.  Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication , 2021, International Conference on Architectural Support for Programming Languages and Operating Systems.

[5]  Süreyya Emre Kurt,et al.  Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Nitish Srivastava,et al.  MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Shao-Yi Chien,et al.  GrateTile: Efficient Sparse Tensor Tiling for CNN Processing , 2020, 2020 IEEE Workshop on Signal Processing Systems (SiPS).

[8]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  E. Boman,et al.  On Optimal Partitioning For Sparse Matrices In Variable Block Row Format , 2020, ArXiv.

[10]  Bahar Asgari,et al.  ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Song Han,et al.  SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Nitish Srivastava,et al.  Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Ariful Azad,et al.  Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors , 2019, Parallel Comput..

[15]  Donghyuk Lee,et al.  Near-memory data transformation for efficient sparse matrix multi-vector multiplication , 2019, SC.

[16]  Gunnar Rätsch,et al.  Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons , 2019, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Vivienne Sze,et al.  Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[19]  Nathan Beckmann,et al.  PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[20]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[21]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[22]  Xiaolan Liu,et al.  A Sequentially Truncated Higher Order Singular Value Decomposition-Based Algorithm for Tensor Completion , 2019, IEEE Transactions on Cybernetics.

[23]  Jason Clemons,et al.  Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration , 2019, ASPLOS.

[24]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[25]  Katherine Yelick,et al.  BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper , 2018, bioRxiv.

[26]  Christopher W. Fletcher,et al.  Morph: Flexible Acceleration for 3D CNN-Based Video Understanding , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[28]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[29]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[31]  Georgios A. Pavlopoulos,et al.  HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks , 2018, Nucleic acids research.

[32]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[33]  Hans-Peter Seidel,et al.  Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU , 2017, ICS.

[34]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[35]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[37]  Michael Garland,et al.  Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format , 2016, PPoPP.

[38]  Torsten Hoefler,et al.  Sparse Tensor Algebra as a Parallel Programming Model , 2015, ArXiv.

[39]  Tamara G. Kolda,et al.  Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[41]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  Austin R. Benson,et al.  A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[43]  Michael Stonebraker,et al.  Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[44]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[45]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[46]  Daniel Kats,et al.  Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.

[47]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[48]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[49]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[50]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[51]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[52]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[53]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[54]  S. Dongen Graph clustering by flow simulation , 2000 .

[55]  A. Einstein The Foundation of the General Theory of Relativity , 1916 .

[56]  A. Einstein,et al.  Die Grundlage der allgemeinen Relativitätstheorie , 1916 .