Systolic Building Block for Logic-on-Logic 3D-IC Implementations of Convolutional Neural Networks

We present a building block architecture for systolic array 3D-IC implementations of convolutional neural network (CNN) inference. The building block can be part of a library offered by a chip design service provider to support efficient CNN implementations. We describe how the building block can form systolic arrays for implementing low-latency, energy-efficient CNN inference for models of any size, while incorporating advanced packaging features such as “logic-on-logic” 3D-IC (micro-bump/TSV, monolithic 3D or other 3D technology). We present delay and power analysis for 2D and 3D implementations, and argue that as systolic arrays scale in size, 3D implementations based on, e.g., micro-bump/TSV, lead to significant performance improvements over 2D implementations.

[1]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[2]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[5]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  H. T. Kung,et al.  Mapping Systolic Arrays onto 3D Circuit Structures: Accelerating Convolutional Neural Network Inference , 2018, 2018 IEEE International Workshop on Signal Processing Systems (SiPS).

[7]  M. F. Chen,et al.  Thinning, stacking, and TSV proximity effects for Poly and High-K/Metal Gate CMOS devices in an advanced 3D integration process , 2012, 2012 International Electron Devices Meeting.

[8]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[9]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[10]  H. T. Kung,et al.  Adaptive Tiling: Applying Fixed-size Systolic Arrays To Sparse Convolutional Neural Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[11]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[12]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[13]  H. T. Kung Memory requirements for balanced computer architectures , 1986, ISCA '86.