NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI

Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22 nm FD-SOI technology show the accelerator’s capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these results we show that a version of NTX scaled down to 14 nm can achieve a 3× energy efficiency improvement over contemporary GPUs at 10.4× less silicon area, and a compute performance of 1.4 Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87% of its peak performance across general reduction workloads beyond machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michele Magno,et al.  Accelerating real-time embedded scene labeling with convolutional networks , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Luca Benini,et al.  Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes , 2017, IEEE Transactions on Parallel and Distributed Systems.

[4]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Luca Benini,et al.  A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets , 2018, IEEE Transactions on Computers.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[12]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[13]  Luca Benini,et al.  Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[14]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[15]  Eitan Grinspun,et al.  Discrete laplace operators: no free lunch , 2007, Symposium on Geometry Processing.

[16]  Gary L. Miller,et al.  Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing , 2009, Comput. Vis. Image Underst..

[17]  Richard Szeliski,et al.  Multigrid and multilevel preconditioners for computational photography , 2011, ACM Trans. Graph..

[18]  Tianshi Chen,et al.  DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[19]  Samuel Williams,et al.  Hardware/software co-design for energy-efficient seismic modeling , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).