DaDianNao: A Neural Network Supercomputer

Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system. We implement the node down to the place and route at 28 nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

[1]  Richard E. Matick,et al.  Logic-based eDRAM: Origins and rationale for use , 2005, IBM J. Res. Dev..

[2]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ron O. Dror,et al.  Anton: A specialized ASIC for molecular dynamics , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[5]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[8]  Yurii A. Vlasov,et al.  Silicon CMOS-integrated nano-photonics for computer and data communications beyond 100G , 2012, IEEE Communications Magazine.

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Luca Maria Gambardella,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[11]  Keren Bergman,et al.  Integrated thermal stabilization of a microring modulator , 2013 .

[12]  K. Bergman,et al.  Thermal stabilization of a microring modulator using feedback control. , 2012, Optics express.

[13]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[14]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[15]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[16]  Enrico Temporiti,et al.  22.9 A 1310nm 3D-integrated silicon photonics Mach-Zehnder-based transmitter with 275mW multistage CMOS driver achieving 6dB extinction ratio at 25Gb/s , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[17]  Qi Li,et al.  Silicon Photonics for Exascale Systems , 2014, Journal of Lightwave Technology.

[18]  Yoichi Koyanagi,et al.  22.2 A 25Gb/s hybrid integrated silicon photonic transceiver in 28nm CMOS and SOI , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[19]  K. Bergman,et al.  Silicon Photonic Microring Links for High-Bandwidth-Density, Low-Power Chip I/O , 2013, IEEE Micro.

[20]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[21]  J. P. Grossman,et al.  Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Michiel Steyaert,et al.  22.5 A 4×20Gb/s WDM ring-based hybrid CMOS silicon photonics transceiver , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[23]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[24]  Giuseppe Caire,et al.  Compute-and-Forward Strategies for Cooperative Distributed Antenna Systems , 2012, IEEE Transactions on Information Theory.

[25]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[26]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[27]  Nong Xiao,et al.  Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support , 2012, IEEE Transactions on Computers.

[28]  Srihari Cadambi,et al.  A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[29]  H. Thacker,et al.  Exploiting CMOS Manufacturing to Reduce Tuning Requirements for Resonant Optical Devices , 2011, IEEE Photonics Journal.

[30]  Tanaka Shinsuke,et al.  A 25Gb/s Hybrid Integrated Silicon Photonic Transceiver in 28nm CMOS and SOI , 2015 .

[31]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  S. Natarajan,et al.  A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[33]  Urs A. Muller,et al.  Learning long-range vision for autonomous off-road driving , 2009 .

[34]  Johannes Schemmel,et al.  Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[35]  M. Watts,et al.  Silicon photonics manufacturing. , 2010, Optics express.

[36]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[37]  Dharmendra S. Modha,et al.  A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[38]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[39]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Mikko H. Lipasti,et al.  Automatic abstraction and fault tolerance in cortical microachitectures , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[41]  P. Heidelberger,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[42]  Steven Swanson,et al.  QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Jason T. S. Liao,et al.  Optical I/O technology for tera-scale computing , 2009, ISSCC 2009.

[44]  P. K. Dubey,et al.  Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[45]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[46]  Mikko H. Lipasti,et al.  A case for neuromorphic ISAs , 2011, ASPLOS XVI.

[47]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[49]  Zheng Li,et al.  Continuous real-world inputs can open up alternative accelerator designs , 2013, ISCA.

[50]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[51]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[52]  K. McStay,et al.  Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[53]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[54]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[55]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[56]  Bernard Brezzo,et al.  TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[57]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[58]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[59]  Luis A. Plana,et al.  SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[60]  Koji Nii,et al.  A 0.41 µA Standby Leakage 32 kb Embedded SRAM with Low-Voltage Resume-Standby Utilizing All Digital Current Comparator in 28 nm HKMG CMOS , 2013, IEEE Journal of Solid-State Circuits.

[61]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[63]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.