Efficient Mapping of CNNs onto Tightly Coupled Processor Arrays

In this work, we show how to systematically map Convolütional Neüral Networks (CNNs) onto Tightly Coüpled Processor Arrays (TCPAs), a class of massively parallel accelerators for many compütationally intensive tasks (e.g., from the digital signal and image processing domain). Contrary to previoüs approaches and implementations, we propose techniqües for the layer-parallel execütion of CNNs on processor arrays inclüding the maximally overlapped processing of consecütive layers. This is achieved throügh layer füsion and loop ünrolling to exploit the füll pipelining potential of süch massively parallel architectüres for given CNNs. These transformations are also necessary to decrease the nümber of onchip/off-chip data transfers. For CNNs, we present a calcülüs for achievable performance and memory reqüirements on TCPAs. Based on this calcülüs, it is shown how either throüghpüt-maximal mappings can be determined for a given architectüre. Alternatively, resoürce-minimized mappings to süstain a given throüghpüt, e.g., nümber of frames per second, are systematically derived. The approach is evalüated for a CNN model for the MNIST benchmark on a processor array of size 4x4 inclüding a comparison of the performance of the layer-parallel approach over layer-by-layer processing.

[1]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[2]  Wayne Luk,et al.  Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[3]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Jürgen Teich,et al.  Accuracy and performance analysis of Harris Corner computation on tightly-coupled processor arrays , 2013, 2013 Conference on Design and Architectures for Signal and Image Processing.

[8]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[9]  Bernhard Egger,et al.  Auto-Tuning CNNs for Coarse-Grained Reconfigurable Array-Based Accelerators , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Jun Yao,et al.  A CGRA-Based Approach for Accelerating Convolutional Neural Networks , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[11]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..