Origami: A Convolutional Network Accelerator

Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can be used to tackle many different challenges by only changing their parameters. In this paper, we present the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems. The architecture has been implemented on 3.09 mm2 core area in UMC 65 nm technology, capable of a throughput of 274 GOp/s at 369 GOp/s/W with an external memory bandwidth of just 525 MB/s full-duplex " a decrease of more than 90% from previous work.

[1]  Farnam Jahanian,et al.  Internet inter-domain traffic , 2010, SIGCOMM '10.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[4]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[5]  E. Culurciello,et al.  NeuFlow: Dataflow vision processing system-on-a-chip , 2012, 2012 IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS).

[6]  LabovitzCraig,et al.  Internet inter-domain traffic , 2010 .

[7]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[9]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[10]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[12]  Sharath Pankanti,et al.  Video surveillance: past, present, and now the future [DSP Forum] , 2013, IEEE Signal Processing Magazine.

[13]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Christophe Bobda,et al.  Distributed Embedded Smart Cameras: Architectures, Design and Applications , 2014 .

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Luca Benini,et al.  DRAM or no-DRAM? Exploring linear solver architectures for image domain warping in 28 nm CMOS , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Michele Magno,et al.  Accelerating real-time embedded scene labeling with convolutional networks , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Luca Benini,et al.  A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.