DaDianNao: A Machine-Learning Supercomputer

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[3]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[4]  Richard E. Matick,et al.  Logic-based eDRAM: Origins and rationale for use , 2005, IBM J. Res. Dev..

[5]  P. K. Dubey,et al.  Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[6]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[7]  David E. Shaw,et al.  Anton: A specialized ASIC for molecular dynamics , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[8]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Johannes Schemmel,et al.  Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[10]  Luis A. Plana,et al.  SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[11]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Yann LeCun,et al.  Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[15]  K. McStay,et al.  Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[16]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[17]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[19]  Dharmendra S. Modha,et al.  A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[20]  Luca Maria Gambardella,et al.  Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[21]  Mikko H. Lipasti,et al.  A case for neuromorphic ISAs , 2011, ASPLOS XVI.

[22]  S. Natarajan,et al.  A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[23]  Mikko H. Lipasti,et al.  Automatic abstraction and fault tolerance in cortical microachitectures , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[25]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Yasuhisa Shimazaki,et al.  A 0.41µA standby leakage 32Kb embedded SRAM with Low-Voltage resume-standby utilizing all digital current comparator in 28nm HKMG CMOS , 2012, 2012 Symposium on VLSI Circuits (VLSIC).

[27]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[28]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Srihari Cadambi,et al.  A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[31]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[34]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[35]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[36]  Nong Xiao,et al.  Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support , 2012, IEEE Transactions on Computers.

[37]  Burkhard D. Steinmacher-Burow,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[38]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[39]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Zheng Li,et al.  Continuous real-world inputs can open up alternative accelerator designs , 2013, ISCA.

[42]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[43]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[44]  Giuseppe Caire,et al.  Compute-and-Forward Strategies for Cooperative Distributed Antenna Systems , 2012, IEEE Transactions on Information Theory.

[45]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[46]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.