TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators

Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art. To address this, we propose TeAAL: a language and compiler for the concise and precise specification and evaluation of sparse tensor algebra architectures. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up Graphicionado--by $38\times$ on BFS and $4.3\times$ on SSSP.

[1]  Christopher W. Fletcher,et al.  Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling , 2023, ASPLOS.

[2]  Jos'e L. Abell'an,et al.  Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing , 2023, ASPLOS.

[3]  K. Olukotun,et al.  The Sparse Abstract Machine , 2022, ASPLOS.

[4]  Saman P. Amarasinghe,et al.  Autoscheduling for sparse tensor algebra with an asymptotic cost model , 2022, PLDI.

[5]  Yannan Nellie Wu,et al.  Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling , 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Rajgopal Kannan,et al.  Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).

[7]  José L. Abellán,et al.  STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators , 2021, IEEE Computer Architecture Letters.

[8]  Camila Bohle Silva,et al.  SIGMA: , 2021, EL MEJOR PERIODISMO CHILENO 2020.

[9]  James Demmel,et al.  CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[10]  J. Emer,et al.  Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication , 2021, International Conference on Architectural Support for Programming Languages and Operating Systems.

[11]  Vikas Chandra,et al.  Mind mappings: enabling efficient algorithm-accelerator mapping space search , 2021, ASPLOS.

[12]  Vivienne Sze,et al.  Sparseloop: An Analytical, Energy-Focused Design Space Exploration Methodology for Sparse Tensor Accelerators , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Shoaib Kamil,et al.  A sparse iteration space transformation framework for sparse tensor algebra , 2020, Proc. ACM Program. Lang..

[14]  Nitish Srivastava,et al.  MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Andreas Moshovos,et al.  TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Marian Verhelst,et al.  ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework , 2020, ArXiv.

[17]  V. Sze,et al.  Efficient Processing of Deep Neural Networks , 2020, Synthesis Lectures on Computer Architecture.

[18]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Nitish Srivastava,et al.  Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20]  Song Han,et al.  SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Ariful Azad,et al.  Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors , 2019, Parallel Comput..

[22]  Vivienne Sze,et al.  Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[23]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[24]  Jason Clemons,et al.  Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration , 2019, ASPLOS.

[25]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[26]  Mingyu Gao,et al.  Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators , 2018, ASPLOS.

[27]  Xing Liu,et al.  Blocking Optimization Techniques for Sparse Tensor Computation , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Hyoukjun Kwon,et al.  MAESTRO: An Open-source Infrastructure for Modeling Dataflows within Deep Learning Accelerators , 2018, ArXiv.

[29]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[30]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[31]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[32]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[33]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[35]  Patrick Seewald,et al.  Large-Scale Cubic-Scaling Random Phase Approximation Correlation Energy Calculations Using a Gaussian Basis. , 2016, Journal of chemical theory and computation.

[36]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[42]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[43]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[44]  Michael Stonebraker,et al.  Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[45]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[46]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[47]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[48]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[49]  Joost VandeVondele,et al.  Linear Scaling Self-Consistent Field Calculations with Millions of Atoms in the Condensed Phase. , 2012, Journal of chemical theory and computation.

[50]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[51]  홍원식 Performance , 2005 .

[52]  S. Dongen Graph clustering by flow simulation , 2000 .

[53]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[54]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[55]  Joost VandeVondele,et al.  cp2k: atomistic simulations of condensed matter systems , 2014 .

[56]  Trevor Mudge,et al.  Notes on Calculating Computer Performance , 1995 .

[57]  A. Einstein The Foundation of the General Theory of Relativity , 1916 .