论文信息 - DaDianNao: A Neural Network Supercomputer

DaDianNao: A Neural Network Supercomputer

Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system. We implement the node down to the place and route at 28 nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

[1] Richard E. Matick,et al. Logic-based eDRAM: Origins and rationale for use , 2005, IBM J. Res. Dev..

[2] Yann LeCun,et al. What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Ron O. Dror,et al. Anton: A specialized ASIC for molecular dynamics , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[5] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[8] Yurii A. Vlasov,et al. Silicon CMOS-integrated nano-photonics for computer and data communications beyond 100G , 2012, IEEE Communications Magazine.

[9] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10] Luca Maria Gambardella,et al. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[11] Keren Bergman,et al. Integrated thermal stabilization of a microring modulator , 2013 .

[12] K. Bergman,et al. Thermal stabilization of a microring modulator using feedback control. , 2012, Optics express.

[13] Geoffrey E. Hinton,et al. An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[14] David A. Ferrucci,et al. Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[15] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[16] Enrico Temporiti,et al. 22.9 A 1310nm 3D-integrated silicon photonics Mach-Zehnder-based transmitter with 275mW multistage CMOS driver achieving 6dB extinction ratio at 25Gb/s , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[17] Qi Li,et al. Silicon Photonics for Exascale Systems , 2014, Journal of Lightwave Technology.

[18] Yoichi Koyanagi,et al. 22.2 A 25Gb/s hybrid integrated silicon photonic transceiver in 28nm CMOS and SOI , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[19] K. Bergman,et al. Silicon Photonic Microring Links for High-Bandwidth-Density, Low-Power Chip I/O , 2013, IEEE Micro.

[20] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[21] J. P. Grossman,et al. Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22] Michiel Steyaert,et al. 22.5 A 4×20Gb/s WDM ring-based hybrid CMOS silicon photonics transceiver , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[23] Scott A. Mahlke,et al. Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[24] Giuseppe Caire,et al. Compute-and-Forward Strategies for Cooperative Distributed Antenna Systems , 2012, IEEE Transactions on Information Theory.

[25] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[26] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[27] Nong Xiao,et al. Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support , 2012, IEEE Transactions on Computers.

[28] Srihari Cadambi,et al. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[29] H. Thacker,et al. Exploiting CMOS Manufacturing to Reduce Tuning Requirements for Resonant Optical Devices , 2011, IEEE Photonics Journal.

[30] Tanaka Shinsuke,et al. A 25Gb/s Hybrid Integrated Silicon Photonic Transceiver in 28nm CMOS and SOI , 2015 .

[31] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32] S. Natarajan,et al. A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[33] Urs A. Muller,et al. Learning long-range vision for autonomous off-road driving , 2009 .

[34] Johannes Schemmel,et al. Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[35] M. Watts,et al. Silicon photonics manufacturing. , 2010, Optics express.

[36] Wei Wu,et al. Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[37] Dharmendra S. Modha,et al. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[38] Andrew B. Kahng,et al. ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[39] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40] Mikko H. Lipasti,et al. Automatic abstraction and fault tolerance in cortical microachitectures , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[41] P. Heidelberger,et al. The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[42] Steven Swanson,et al. QSCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43] Jason T. S. Liao,et al. Optical I/O technology for tera-scale computing , 2009, ISSCC 2009.

[44] P. K. Dubey,et al. Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[45] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[46] Mikko H. Lipasti,et al. A case for neuromorphic ISAs , 2011, ASPLOS XVI.

[47] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[49] Zheng Li,et al. Continuous real-world inputs can open up alternative accelerator designs , 2013, ISCA.

[50] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[51] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[52] K. McStay,et al. Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[53] Geoffrey E. Hinton,et al. Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[54] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[55] Olivier Temam,et al. A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[56] Bernard Brezzo,et al. TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[57] Larry P. Heck,et al. Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[58] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[59] Luis A. Plana,et al. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[60] Koji Nii,et al. A 0.41 µA Standby Leakage 32 kb Embedded SRAM with Low-Voltage Resume-Standby Utilizing All Digital Current Comparator in 28 nm HKMG CMOS , 2013, IEEE Journal of Solid-State Circuits.

[61] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62] Christoforos E. Kozyrakis,et al. Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[63] Vikram Bhatt,et al. The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.