论文信息 - TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators

TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators

Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art. To address this, we propose TeAAL: a language and compiler for the concise and precise specification and evaluation of sparse tensor algebra architectures. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up Graphicionado--by $38\times$ on BFS and $4.3\times$ on SSSP.

[1] Christopher W. Fletcher,et al. Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling , 2023, ASPLOS.

[2] Jos'e L. Abell'an,et al. Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing , 2023, ASPLOS.

[3] K. Olukotun,et al. The Sparse Abstract Machine , 2022, ASPLOS.

[4] Saman P. Amarasinghe,et al. Autoscheduling for sparse tensor algebra with an asymptotic cost model , 2022, PLDI.

[5] Yannan Nellie Wu,et al. Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling , 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Rajgopal Kannan,et al. Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).

[7] José L. Abellán,et al. STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators , 2021, IEEE Computer Architecture Letters.

[8] Camila Bohle Silva,et al. SIGMA: , 2021, EL MEJOR PERIODISMO CHILENO 2020.

[9] James Demmel,et al. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[10] J. Emer,et al. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication , 2021, International Conference on Architectural Support for Programming Languages and Operating Systems.

[11] Vikas Chandra,et al. Mind mappings: enabling efficient algorithm-accelerator mapping space search , 2021, ASPLOS.

[12] Vivienne Sze,et al. Sparseloop: An Analytical, Energy-Focused Design Space Exploration Methodology for Sparse Tensor Accelerators , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13] Shoaib Kamil,et al. A sparse iteration space transformation framework for sparse tensor algebra , 2020, Proc. ACM Program. Lang..

[14] Nitish Srivastava,et al. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15] Andreas Moshovos,et al. TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16] Marian Verhelst,et al. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework , 2020, ArXiv.

[17] V. Sze,et al. Efficient Processing of Deep Neural Networks , 2020, Synthesis Lectures on Computer Architecture.

[18] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Nitish Srivastava,et al. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20] Song Han,et al. SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21] Ariful Azad,et al. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors , 2019, Parallel Comput..

[22] Vivienne Sze,et al. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[23] Aamer Jaleel,et al. ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[24] Jason Clemons,et al. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration , 2019, ASPLOS.

[25] Brucek Khailany,et al. Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[26] Mingyu Gao,et al. Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.

[27] Xing Liu,et al. Blocking Optimization Techniques for Sparse Tensor Computation , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28] Hyoukjun Kwon,et al. MAESTRO: An Open-source Infrastructure for Modeling Dataflows within Deep Learning Accelerators , 2018, ArXiv.

[29] Saman P. Amarasinghe,et al. Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[30] David Blaauw,et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[31] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[32] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[33] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[35] Patrick Seewald,et al. Large-Scale Cubic-Scaling Random Phase Approximation Correlation Energy Calculations Using a Gaussian Basis. , 2016, Journal of chemical theory and computation.

[36] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37] Torsten Hoefler,et al. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39] Vivienne Sze,et al. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41] John R. Gilbert,et al. Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[42] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[43] Pradeep Dubey,et al. GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[44] Michael Stonebraker,et al. Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[45] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[46] Samuel Williams,et al. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[47] Rasmus Pagh,et al. Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[48] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[49] Joost VandeVondele,et al. Linear Scaling Self-Consistent Field Calculations with Millions of Atoms in the Condensed Phase. , 2012, Journal of chemical theory and computation.

[50] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[51] 홍원식. Performance , 2005 .

[52] S. Dongen. Graph clustering by flow simulation , 2000 .

[53] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[54] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[55] Joost VandeVondele,et al. cp2k: atomistic simulations of condensed matter systems , 2014 .

[56] Trevor Mudge,et al. Notes on Calculating Computer Performance , 1995 .

[57] A. Einstein. The Foundation of the General Theory of Relativity , 1916 .