Memory-centric accelerator design for Convolutional Neural Networks

In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

[1]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Katsushi Ikeuchi,et al.  Traffic monitoring and accident detection at intersections , 2000, IEEE Trans. Intell. Transp. Syst..

[4]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Aleksandar Beric,et al.  Memory-centric video processing , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[9]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Kristof Beyls,et al.  Refactoring for Data Locality , 2009, Computer.

[11]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[12]  Henk Corporaal,et al.  Speed sign detection and recognition by convolutional neural networks , 2011 .

[13]  Henk Corporaal,et al.  Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform , 2011, ACIVS.

[14]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[15]  Sven Verdoolaege,et al.  Polyhedral Extraction Tool , 2012 .

[16]  Thad Starner,et al.  Project Glass: An Extension of the Self , 2013, IEEE Pervasive Computing.

[17]  Henk Corporaal,et al.  Optimal iteration scheduling for intra- and inter-tile reuse in nested loop accelerators , 2013 .

[18]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.