论文信息 - DaDianNao: A Machine-Learning Supercomputer

DaDianNao: A Machine-Learning Supercomputer

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

[1] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[3] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[4] Richard E. Matick,et al. Logic-based eDRAM: Origins and rationale for use , 2005, IBM J. Res. Dev..

[5] P. K. Dubey,et al. Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[6] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[7] David E. Shaw,et al. Anton: A specialized ASIC for molecular dynamics , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[8] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9] Johannes Schemmel,et al. Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[10] Luis A. Plana,et al. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[11] Scott A. Mahlke,et al. Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13] Yann LeCun,et al. What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14] Yann LeCun,et al. Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[15] K. McStay,et al. Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[16] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[17] Steven Swanson,et al. QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[19] Dharmendra S. Modha,et al. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[20] Luca Maria Gambardella,et al. Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[21] Mikko H. Lipasti,et al. A case for neuromorphic ISAs , 2011, ASPLOS XVI.

[22] S. Natarajan,et al. A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[23] Mikko H. Lipasti,et al. Automatic abstraction and fault tolerance in cortical microachitectures , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[25] Andrew B. Kahng,et al. ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26] Yasuhisa Shimazaki,et al. A 0.41µA standby leakage 32Kb embedded SRAM with Low-Voltage resume-standby utilizing all digital current comparator in 28nm HKMG CMOS , 2012, 2012 Symposium on VLSI Circuits (VLSIC).

[27] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[28] Olivier Temam,et al. A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Srihari Cadambi,et al. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[31] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[32] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33] David A. Ferrucci,et al. Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[34] Geoffrey E. Hinton,et al. Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[35] Geoffrey E. Hinton,et al. An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[36] Nong Xiao,et al. Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support , 2012, IEEE Transactions on Computers.

[37] Burkhard D. Steinmacher-Burow,et al. The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[38] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[39] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Zheng Li,et al. Continuous real-world inputs can open up alternative accelerator designs , 2013, ISCA.

[42] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[43] Christoforos E. Kozyrakis,et al. Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[44] Giuseppe Caire,et al. Compute-and-Forward Strategies for Cooperative Distributed Antenna Systems , 2012, IEEE Transactions on Information Theory.

[45] Larry P. Heck,et al. Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[46] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.