论文信息 - DianNao family

DianNao family

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope. While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl's law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step. In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML (especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).

[1] San Cristóbal Mateo,et al. The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3] Abel G. Silva-Filho,et al. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm , 2003, 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings..

[4] Tsutomu Maruyama. Real-time K-Means Clustering for Color Images on Reconfigurable Hardware , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5] Noel E. O'Connor,et al. Towards Hardware Acceleration of Neuroevolution for Multimedia Processing Applications on Mobile Devices , 2006, ICONIP.

[6] Chiung-Yao Fang,et al. FPGA Implementation of k NN Classifier Based on Wavelet Transform and Partial Distance Search , 2007, SCIA.

[7] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8] Srihari Cadambi,et al. A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[9] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] K. McStay,et al. Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[11] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Jia Deng,et al. A large-scale hierarchical image database , 2009, CVPR 2009.

[13] Olivier Temam. The rebirth of neural networks , 2010, ISCA '10.

[14] Hoi-Jun Yoo,et al. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[15] Christos-Savvas Bouganis,et al. A Heterogeneous FPGA Architecture for Support Vector Machine Training , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[16] Elias S. Manolakos,et al. IP-cores design for the kNN classifier , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[17] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[18] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[19] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[20] Huseyin Seker,et al. FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[21] Yann LeCun,et al. Traffic sign recognition with multi-scale Convolutional Networks , 2011, The 2011 International Joint Conference on Neural Networks.

[22] Srihari Cadambi,et al. An Energy-Efficient Heterogeneous System for Embedded Learning and Classification , 2011, IEEE Embedded Systems Letters.

[23] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[24] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[25] Yasuhisa Shimazaki,et al. A 0.41µA standby leakage 32Kb embedded SRAM with Low-Voltage resume-standby utilizing all digital current comparator in 28nm HKMG CMOS , 2012, 2012 Symposium on VLSI Circuits (VLSIC).

[26] Narayanan Vijaykrishnan,et al. Accelerating neuromorphic vision algorithms for recognition , 2012, DAC Design Automation Conference 2012.

[27] Olivier Temam,et al. A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[28] Srihari Cadambi,et al. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[29] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[30] Yann LeCun,et al. Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[31] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32] Mikko H. Lipasti,et al. BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[33] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[34] Elias S. Manolakos,et al. Parallel architectures for the kNN classifier -- design of soft IP cores and FPGA implementations , 2013, TECS.

[35] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36] Koji Nii,et al. A 0.41 µA Standby Leakage 32 kb Embedded SRAM with Low-Voltage Resume-Standby Utilizing All Digital Current Comparator in 28 nm HKMG CMOS , 2013, IEEE Journal of Solid-State Circuits.

[37] Ernie Chan. Algorithmic Trading: Winning Strategies and Their Rationale , 2013 .

[38] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[39] Christoforos E. Kozyrakis,et al. Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[40] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[41] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[42] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[43] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[44] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..