SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

The advent of Deep Learning (DL) has radically transformed the computing industry across the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it has become synonymous with a genre of workloads across vision, speech, language, recommendations, robotics, and games. The key compute kernel within most DL workloads is general matrix-matrix multiplications (GEMMs), which appears frequently during both the forward pass (inference and training) and backward pass (training). GEMMs are a natural choice for hardware acceleration to speed up training, and have led to 2D systolic architectures like NVIDIA tensor cores and Google Tensor Processing Unit (TPU). Unfortunately, emerging GEMMs in DL are highly irregular and sparse, which lead to poor data mappings on systolic architectures. This paper proposes SIGMA, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity. Within SIGMA includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN). SIGMA performs 5.7x better than systolic array architectures for irregular sparse matrices, and roughly 3x better than state-of-the-art sparse accelerators. We demonstrate an instance of SIGMA operating at 10.8 TFLOPS efficiency across arbitrary levels of sparsity, with a 65.10 mm^2 and 22.33 W footprint on a 28 nm process.

[1]  Christopher R'e,et al.  Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.

[2]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3]  Xuehai Qian,et al.  HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[5]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[6]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[8]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[10]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[11]  Luca Benini,et al.  A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets , 2018, IEEE Transactions on Computers.

[12]  Bruce M. Maggs,et al.  On-line algorithms for path selection in a nonblocking network , 1990, STOC '90.

[13]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[14]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[15]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[16]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[17]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[18]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[19]  Chunhua Deng,et al.  PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[22]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[23]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[24]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[29]  Vivek Sarkar,et al.  Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.

[30]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[32]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[33]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[34]  Tony T. Lee,et al.  Parallel routing algorithms in Benes-Clos networks , 2002, IEEE Trans. Commun..

[35]  Christoforos E. Kozyrakis,et al.  TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[36]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Gu-Yeon Wei,et al.  MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[38]  Sourav Mukhopadhyay,et al.  Matrix-Based Nonblocking Routing Algorithm for Beneš Networks , 2009, 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

[39]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[40]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).