论文信息 - Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT’s key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3×, 5.1× and 1.6×, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9× and 16.9× over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29× and 2.94× in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).

[1] Christopher W. Fletcher,et al. Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract) , 2023, HOPC@SPAA.

[2] Ümit V. Çatalyürek,et al. On Symmetric Rectilinear Partitioning , 2022, ACM J. Exp. Algorithmics.

[3] Jaehyuk Huh,et al. InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-Aware Inner Product Processing , 2021, 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4] J. Emer,et al. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication , 2021, International Conference on Architectural Support for Programming Languages and Operating Systems.

[5] Süreyya Emre Kurt,et al. Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6] Nitish Srivastava,et al. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7] Shao-Yi Chien,et al. GrateTile: Efficient Sparse Tensor Tiling for CNN Processing , 2020, 2020 IEEE Workshop on Signal Processing Systems (SiPS).

[8] Erich Elsen,et al. Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] E. Boman,et al. On Optimal Partitioning For Sparse Matrices In Variable Block Row Format , 2020, ArXiv.

[10] Bahar Asgari,et al. ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11] Song Han,et al. SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] Nitish Srivastava,et al. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14] Ariful Azad,et al. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors , 2019, Parallel Comput..

[15] Donghyuk Lee,et al. Near-memory data transformation for efficient sparse matrix multi-vector multiplication , 2019, SC.

[16] Gunnar Rätsch,et al. Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons , 2019, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17] Vivienne Sze,et al. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18] Aamer Jaleel,et al. ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[19] Nathan Beckmann,et al. PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[20] Tze Meng Low,et al. Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[21] T. N. Vijaykumar,et al. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[22] Xiaolan Liu,et al. A Sequentially Truncated Higher Order Singular Value Decomposition-Based Algorithm for Tensor Completion , 2019, IEEE Transactions on Cybernetics.

[23] Jason Clemons,et al. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration , 2019, ASPLOS.

[24] P. Sadayappan,et al. Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[25] Katherine Yelick,et al. BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper , 2018, bioRxiv.

[26] Christopher W. Fletcher,et al. Morph: Flexible Acceleration for 3D CNN-Based Video Understanding , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27] Saman P. Amarasinghe,et al. Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[28] Mengjia Yan,et al. UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[29] David Blaauw,et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30] John D. Owens,et al. Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[31] Georgios A. Pavlopoulos,et al. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks , 2018, Nucleic acids research.

[32] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[33] Hans-Peter Seidel,et al. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU , 2017, ICS.

[34] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[35] Torsten Hoefler,et al. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36] Vivienne Sze,et al. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[37] Michael Garland,et al. Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format , 2016, PPoPP.

[38] Torsten Hoefler,et al. Sparse Tensor Algebra as a Parallel Programming Model , 2015, ArXiv.

[39] Tamara G. Kolda,et al. Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40] John R. Gilbert,et al. Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[41] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42] Austin R. Benson,et al. A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[43] Michael Stonebraker,et al. Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[44] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[45] Samuel Williams,et al. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[46] Daniel Kats,et al. Sparse tensor framework for implementation of general local correlation methods. , 2013, The Journal of chemical physics.

[47] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[48] John R. Gilbert,et al. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[49] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[50] Hyun Jin Moon,et al. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[51] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[52] Sebastiano Vigna,et al. The webgraph framework I: compression techniques , 2004, WWW '04.

[53] Larry Carter,et al. Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[54] S. Dongen. Graph clustering by flow simulation , 2000 .

[55] A. Einstein. The Foundation of the General Theory of Relativity , 1916 .

[56] A. Einstein,et al. Die Grundlage der allgemeinen Relativitätstheorie , 1916 .