Optimal Tiling Strategy for Memory Bandwidth Reduction for CNNs

Convolutional Neural Networks (CNNs), are nowadays present in many different embedded solutions. One of the biggest problems related to their execution is the memory bottleneck. In this work we propose an optimal double buffering tiling strategy, to reduce the memory bandwidth in the execution of deep CNN architecture, testing our model on one of the two cores of a Zynq®-7020 embedded platform. An optimal tiling strategy is found for each layer of the network, optimizing for lowest external memory \(\rightleftharpoons \) On-Chip memory bandwidth. Performance test results show an improvement in the total execution time of 50% (cache disabled/34% cache enabled), compared to a non double buffered implementation. Moreover, a 5x lower external memory \(\rightleftharpoons \) On-Chip memory double buffering memory bandwidth is achieved, with respect to naive tiling settings. Furthermore it is shown that tiling settings for highest OCM usage do not generally lead to the lowest bandwidth scenario.

[1]  Li Wang,et al.  Improving scratchpad allocation with demand-driven data tiling , 2010, CASES '10.

[2]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[4]  Claus Nebauer,et al.  Evaluation of convolutional neural networks for visual recognition , 1998, IEEE Trans. Neural Networks.

[5]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[6]  Jingling Xue,et al.  Code tiling for improving the cache performance of PDE solvers , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[7]  Selma Saidi,et al.  Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms , 2013, Microprocess. Microsystems.

[8]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[9]  Narayanan Vijaykrishnan,et al.  Hardware Acceleration for Neuromorphic Vision Algorithms , 2013, J. Signal Process. Syst..

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Luca Benini,et al.  Brain-Inspired Classroom Occupancy Monitoring on a Low-Power Mobile Platform , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.