Cambricon: An Instruction Set Architecture for Neural Networks

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

[1]  Yu Tsao,et al.  Robust anchorperson detection based on audio streams using a hybrid I-vector and DNN system , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[2]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[3]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[5]  Tao Zhang,et al.  Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Yu Wang,et al.  Energy Efficient RRAM Spiking Neural Network for Real Time Classification , 2015, ACM Great Lakes Symposium on VLSI.

[9]  Geoffrey E. Hinton,et al.  Application of Deep Belief Networks for Natural Language Understanding , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[12]  Vitit Kantabutra,et al.  On Hardware for Computing Exponential and Trigonometric Functions , 1996, IEEE Trans. Computers.

[13]  G. Marsaglia,et al.  The Ziggurat Method for Generating Random Variables , 2000 .

[14]  Andrew S. Cassidy,et al.  A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[15]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[16]  Yann LeCun,et al.  Traffic sign recognition with multi-scale Convolutional Networks , 2011, The 2011 International Joint Conference on Neural Networks.

[17]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[18]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[21]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[23]  Narayanan Vijaykrishnan,et al.  Accelerating neuromorphic vision algorithms for recognition , 2012, DAC Design Automation Conference 2012.

[24]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[25]  Carlo H. Séquin,et al.  RISC I: a reduced instruction set VLSI computer , 1981, ISCA '98.

[26]  Babak Nadjar Araabi,et al.  Neural network stream processing core (NnSP) for embedded systems , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[27]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[28]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[29]  Victor Eijkhout,et al.  Introduction to High Performance Scientific Computing , 2015 .

[30]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[32]  Jia Wang,et al.  A High-Throughput Neural Network Accelerator , 2015, IEEE Micro.

[33]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[34]  Yuan Xie,et al.  Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface , 2013, TACO.

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Pineda,et al.  Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[38]  M.A. Motter Control of the NASA Langley 16-foot transonic tunnel with the self-organizing map , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[39]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[40]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Jiao Wang,et al.  Auto-Associative Neural Network System for Recognition , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[42]  E HintonGeoffrey,et al.  Application of Deep Belief Networks for natural language understanding , 2014 .

[43]  Yanli Li,et al.  Improved SOM based data mining of seasonal flu in mainland China , 2012, 2012 8th International Conference on Natural Computation.

[44]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[45]  Zhengyou Zhang,et al.  Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[46]  C.S. Oliveira,et al.  Forms of adapting patterns to Hopfield neural networks with larger number of nodes and higher storage capacity , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[47]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[48]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[49]  Mikko H. Lipasti,et al.  A case for neuromorphic ISAs , 2011, ASPLOS XVI.