Efficient Scheduling of Irregular Network Structures on CNN Accelerators

The state-of-the-art convolutional neural network (CNN) structures present growing irregularity in the sense of layer connections, which derives from the innovative manual designs and the recently proposed neural architecture searching approaches. Such irregular structures improve recognition accuracy, but also bring challenges for hardware deployment, especially on CNN accelerators with regular architectures: 1) the complicated data dependency makes it nontrivial to decide the data reuse strategy between layers and 2) since the execution order of each network is not unique, the choice of layer scheduling, memory allocating, and loop tiling strategies greatly impact the hardware performance. These challenges cannot be solved by the existing CNN schedulers, which mainly focuses on the dataflow of a single layer. In this work, we propose a comprehensive framework to analyze and solve the mapping of an arbitrarily connected CNN network to specific hardware accelerators. We propose: 1) a dynamic programming and node-clustering-based DAG partitioning approach to efficiently exploit interlayer data reuse and 2) a subgraph scheduling and on-chip memory allocating strategy to find the optimal execution order. With the modeling of CNN accelerators, we also propose a loop tiling approach for fused layers. An automated framework is established to generate binary machine codes from original CNN models produced by mainstream deep learning frameworks, which can process large-scale CNNs with more than 1000 layers in only a few minutes. Experiments based on state-of-the-art accelerators (e.g., NVDLA) show that our techniques greatly reduce the external data transfer of interlayer dependencies and bring significant performance improvement over existing approaches.

[1]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Marian Verhelst,et al.  An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS , 2017, IEEE Journal of Solid-State Circuits.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[7]  Uday Bondhugula,et al.  An effective fusion and tile size model for optimizing image processing pipelines , 2018, PPoPP.

[8]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Sagheer Ahmad,et al.  Xilinx First 7nm Device: Versal AI Core (VC1902) , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[10]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Vivek Sarkar,et al.  Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.

[13]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[19]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[20]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[21]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[22]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Hoi-Jun Yoo,et al.  A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices , 2019, 2019 Symposium on VLSI Circuits.