论文信息 - Cambricon: An Instruction Set Architecture for Neural Networks

Cambricon: An Instruction Set Architecture for Neural Networks

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

[1] Yu Tsao,et al. Robust anchorperson detection based on audio streams using a hybrid I-vector and DNN system , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[2] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[3] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4] Olivier Temam,et al. A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[5] Tao Zhang,et al. Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] Yu Wang,et al. Energy Efficient RRAM Spiking Neural Network for Real Time Classification , 2015, ACM Great Lakes Symposium on VLSI.

[9] Geoffrey E. Hinton,et al. Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[11] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[12] Vitit Kantabutra,et al. On Hardware for Computing Exponential and Trigonometric Functions , 1996, IEEE Trans. Computers.

[13] G. Marsaglia,et al. The Ziggurat Method for Generating Random Variables , 2000 .

[14] Andrew S. Cassidy,et al. A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[15] Srihari Cadambi,et al. A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[16] Yann LeCun,et al. Traffic sign recognition with multi-scale Convolutional Networks , 2011, The 2011 International Joint Conference on Neural Networks.

[17] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .

[18] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] J. Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[21] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[23] Narayanan Vijaykrishnan,et al. Accelerating neuromorphic vision algorithms for recognition , 2012, DAC Design Automation Conference 2012.

[24] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[25] Carlo H. Séquin,et al. RISC I: a reduced instruction set VLSI computer , 1981, ISCA '98.

[26] Babak Nadjar Araabi,et al. Neural network stream processing core (NnSP) for embedded systems , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[27] Geoffrey E. Hinton,et al. An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[28] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[29] Victor Eijkhout,et al. Introduction to High Performance Scientific Computing , 2015 .

[30] Berin Martini,et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[31] Yann LeCun,et al. CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[32] Jia Wang,et al. A High-Throughput Neural Network Accelerator , 2015, IEEE Micro.

[33] Trevor Hastie,et al. An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[34] Yuan Xie,et al. Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface , 2013, TACO.

[35] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Yann LeCun,et al. What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37] Pineda,et al. Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[38] M.A. Motter. Control of the NASA Langley 16-foot transonic tunnel with the self-organizing map , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[39] Henk Corporaal,et al. Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[40] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Jiao Wang,et al. Auto-Associative Neural Network System for Recognition , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[42] E HintonGeoffrey,et al. Application of Deep Belief Networks for natural language understanding , 2014 .

[43] Yanli Li,et al. Improved SOM based data mining of seasonal flu in mainland China , 2012, 2012 8th International Conference on Natural Computation.

[44] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[45] Zhengyou Zhang,et al. Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[46] C.S. Oliveira,et al. Forms of adapting patterns to Hopfield neural networks with larger number of nodes and higher storage capacity , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[47] Larry P. Heck,et al. Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[48] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[49] Mikko H. Lipasti,et al. A case for neuromorphic ISAs , 2011, ASPLOS XVI.