SCALENet: A SCalable Low power AccELerator for Real-time Embedded Deep Neural Networks

As deep learning networks mature and improve classification performance, a significant challenge is their deployment in embedded settings. Modern network typologies, such as convolutional neural networks, can be very deep and impose considerable complexity that is often not feasible in resource bound, real-time systems. Processing of these networks requires high levels of parallelization, maximizing data throughput, and support for different network types, while minimizing power and resource consumption. In response to these requirements, in this paper, we present a low power FPGA based neural network accelerator named SCALENet: a SCalable Low power AccELerator for real-time deep neural Networks. Key features include optimization for power with coarse and fine grain scheduler, implementation flexibility with hardware only or hardware/software co-design, and acceleration for both fully connected and convolutional layers. The experimental results evaluate SCALENet against two different neural network applications: image processing, and biomedical seizure detection. The image processing networks, implemented on SCALENet, trained on the CIFAR-10 and ImageNet datasets with eight different networks, are implemented on an Arty A7 and Zedboard#8482; FPGA platforms. The highest improvement came with the Inception network on an ImageNet dataset with a throughput of 22x and decrease in energy consumption of 13x compared to the ARM processor implementation. We then implement SCALENet for time series EEG seizure detection using both a Direct Convolution and FFT Convolution method to show its design versatility with a 99.7% reduction in execution time and a 97.9% improvement in energy consumption compared to the ARM. Finally, we demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU System on Chip (SoC) and with a 4x power savings in a power envelope of 2.07 Watts.

[1]  T. Mohsenin,et al.  SPARCNet , 2017 .

[2]  Tinoosh Mohsenin,et al.  Accelerating Convolutional Neural Network With FFT on Embedded Hardware , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[4]  Xiaowei Li,et al.  C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Tinoosh Mohsenin,et al.  Wearable seizure detection using convolutional neural networks with transfer learning , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[6]  Farinaz Koushanfar,et al.  Customizing Neural Networks for Efficient FPGA Implementation , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[7]  Jun-Seok Park,et al.  14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[8]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[9]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[10]  Tinoosh Mohsenin,et al.  Accelerating convolutional neural network with FFT on tiny cores , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[11]  Abdulhamit Subasi,et al.  Classification of EEG signals using neural network and logistic regression , 2005, Comput. Methods Programs Biomed..

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Vivienne Sze,et al.  14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.

[14]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[15]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.