DianNao family

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope. While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).

[1]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Abel G. Silva-Filho,et al.  Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm , 2003, 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings..

[4]  Tsutomu Maruyama Real-time K-Means Clustering for Color Images on Reconfigurable Hardware , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  Noel E. O'Connor,et al.  Towards Hardware Acceleration of Neuroevolution for Multimedia Processing Applications on Mobile Devices , 2006, ICONIP.

[6]  Chiung-Yao Fang,et al.  FPGA Implementation of k NN Classifier Based on Wavelet Transform and Partial Distance Search , 2007, SCIA.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Srihari Cadambi,et al.  A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[9]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  K. McStay,et al.  Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[11]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[13]  Olivier Temam The rebirth of neural networks , 2010, ISCA '10.

[14]  Hoi-Jun Yoo,et al.  A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[15]  Christos-Savvas Bouganis,et al.  A Heterogeneous FPGA Architecture for Support Vector Machine Training , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[16]  Elias S. Manolakos,et al.  IP-cores design for the kNN classifier , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[17]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[18]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[19]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[20]  Huseyin Seker,et al.  FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[21]  Yann LeCun,et al.  Traffic sign recognition with multi-scale Convolutional Networks , 2011, The 2011 International Joint Conference on Neural Networks.

[22]  Srihari Cadambi,et al.  An Energy-Efficient Heterogeneous System for Embedded Learning and Classification , 2011, IEEE Embedded Systems Letters.

[23]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[24]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[25]  Yasuhisa Shimazaki,et al.  A 0.41µA standby leakage 32Kb embedded SRAM with Low-Voltage resume-standby utilizing all digital current comparator in 28nm HKMG CMOS , 2012, 2012 Symposium on VLSI Circuits (VLSIC).

[26]  Narayanan Vijaykrishnan,et al.  Accelerating neuromorphic vision algorithms for recognition , 2012, DAC Design Automation Conference 2012.

[27]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[28]  Srihari Cadambi,et al.  A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[29]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[30]  Yann LeCun,et al.  Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Mikko H. Lipasti,et al.  BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[34]  Elias S. Manolakos,et al.  Parallel architectures for the kNN classifier -- design of soft IP cores and FPGA implementations , 2013, TECS.

[35]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Koji Nii,et al.  A 0.41 µA Standby Leakage 32 kb Embedded SRAM with Low-Voltage Resume-Standby Utilizing All Digital Current Comparator in 28 nm HKMG CMOS , 2013, IEEE Journal of Solid-State Circuits.

[37]  Ernie Chan Algorithmic Trading: Winning Strategies and Their Rationale , 2013 .

[38]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[39]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[40]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[41]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[43]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..