Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

In recent years, attention-based models have achieved impressive performance in natural language processing and computer vision applications by effectively capturing contextual knowledge from the entire sequence. However, the attention mechanism inherently contains a large number of redundant connections, imposing a heavy computational burden on model deployment. To this end, sparse attention has emerged as an attractive approach to reduce the computation and memory footprint, which involves the sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM) at the same time, thus requiring the hardware to eliminate zero-valued operations effectively. Existing techniques based on irregular sparse patterns or regular but coarse-grained patterns lead to low hardware efficiency or less computation saving. This paper proposes Sanger, a framework that harvests sparsity in the attention mechanism through synergistic hardware and software co-design. The software part prunes the attention matrix into a dynamic structured pattern, and the hardware part features a reconfigurable architecture that exploits such patterns. Specifically, we dynamically sparsify vanilla attention based on a quantized prediction of the attention matrix. Then, the sparse mask is re-arranged into structured blocks that are more amenable to hardware implementation. The hardware design of Sanger features a score-stationary dataflow that keeps sparse scores stationary in the PE to avoid decoding overhead. Using this dataflow and a reconfigurable systolic array design, we can unify the computation of SDDMM and SpMM operations. Typically, the PEs can be configured during runtime to support different data access and partial sum accumulation schemes. Experiments on BERT show that Sanger can prune the model to 0.08 - 0.27 sparsity without accuracy loss, achieving 4.64X, 22.7X, 2.39X, and 1.47X speedup compared to V100 GPU, AMD Ryzen Threadripper 3970X CPU, as well as the state-of-the-art attention accelerators A3 and SpAtten.

[1]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[2]  Wencong Xiao,et al.  SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[4]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[5]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[6]  Minyi Guo,et al.  Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[8]  Xuehai Qian,et al.  HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[9]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[10]  Danyang Zhu,et al.  A High-Speed and Low-Complexity Architecture for Softmax Function in Deep Learning , 2018, 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS).

[11]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[12]  Nitish Srivastava,et al.  Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[14]  Guokun Lai,et al.  Large-scale Cloze Test Dataset Created by Teachers , 2017, EMNLP.

[15]  Yun Liang,et al.  SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[16]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[17]  Yun Liang,et al.  An Efficient Hardware Design for Accelerating Sparse CNNs with NAS-based Models , 2021 .

[18]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[19]  Vivienne Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[20]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[21]  Chunhua Deng,et al.  PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Ji Li,et al.  FTRANS: energy-efficient acceleration of transformers using FPGA , 2020, ISLPED.

[23]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[24]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Liqiang Lu,et al.  An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[26]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[27]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[28]  Jason Cong,et al.  TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[29]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[30]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[31]  Yingming Li,et al.  Fine-tune BERT with Sparse Self-Attention Mechanism , 2019, EMNLP.

[32]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[36]  Patrick Judd,et al.  ShapeShifter: Enabling Fine-Grain Data Width Adaptation in Deep Learning , 2019, MICRO.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[38]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[39]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[40]  Yuan Xie,et al.  Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[41]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[42]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[43]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[44]  H. T. Kung,et al.  Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[45]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[46]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[47]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[48]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[49]  James Bennett,et al.  The Netflix Prize , 2007 .

[50]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[51]  Zhiru Zhang,et al.  Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating , 2019, MICRO.

[52]  Yun Liang,et al.  OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[53]  Tao Li,et al.  Prediction Based Execution on Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[56]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[57]  Hanrui Wang,et al.  SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[58]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[60]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[61]  Deog-Kyoon Jeong,et al.  A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[62]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[63]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[64]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[65]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[66]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[67]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.