Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be

[1]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[2]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[3]  Julien Langou,et al.  Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems , 2007, Int. J. High Perform. Comput. Appl..

[4]  Yen-Cheng Kuan,et al.  A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things , 2017, IEEE Transactions on Circuits and Systems I: Regular Papers.

[5]  Gary Wayne Bewick Fast Multiplication: Algorithms and Implementations , 1994 .

[6]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Kevin J. Nowka,et al.  Leading zero anticipation and detection-a comparison of methods , 2001, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Hayden Kwok-Hay So,et al.  Architecture Generator for Type-3 Unum Posit Adder/Subtractor , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Metin Mete Özbilen,et al.  Multi-functional floating-point MAF designs with dot product support , 2008, Microelectron. J..

[12]  Tajana Simunic,et al.  CFPU: Configurable floating point multiplier for energy-efficient computing , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Natalie D. Enright Jerger,et al.  Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets , 2015, ArXiv.

[15]  Soheil Ghiasi,et al.  Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[17]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[18]  Dionysios I. Reisis,et al.  An efficient dual-mode floating-point Multiply-Add Fused Unit , 2010, 2010 17th IEEE International Conference on Electronics, Circuits and Systems.

[19]  Julien Langou,et al.  Accelerating scientific computations with mixed precision algorithms , 2008, Comput. Phys. Commun..

[20]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[21]  Tianshi Chen,et al.  DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[22]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[23]  Eduard Ayguadé,et al.  Low-Precision Floating-Point Schemes for Neural Network Training , 2018, ArXiv.

[24]  Asha Anoosheh,et al.  Efficient floating point precision tuning for approximate computing , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[25]  Seok-Bum Ko,et al.  Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support , 2019, IEEE Transactions on Computers.

[26]  Ki-Seok Chung,et al.  HMC-MAC: Processing-in Memory Architecture for Multiply-Accumulate Operations with Hybrid Memory Cube , 2018, IEEE Computer Architecture Letters.

[27]  Vikas Chandra,et al.  Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations , 2017, ArXiv.

[28]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[29]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[30]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[31]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Nader Bagherzadeh,et al.  Efficient Mitchell’s Approximate Log Multipliers for Convolutional Neural Networks , 2019, IEEE Transactions on Computers.

[33]  Florent de Dinechin,et al.  A mixed-precision fused multiply and add , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[34]  Andrew D. Booth,et al.  A SIGNED BINARY MULTIPLICATION TECHNIQUE , 1951 .

[35]  Yann LeCun,et al.  1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[36]  Alex Noel Joseph Raj,et al.  Efficient dual-precision floating-point fused-multiply-add architecture , 2018, Microprocess. Microsystems.

[37]  Alberto Nannarelli Tunable Floating-Point for Energy Efficient Accelerators , 2018, 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH).

[38]  Nicholas J. Higham,et al.  Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[39]  Dongdong Chen,et al.  Area- and power-efficient iterative single/double-precision merged floating-point multiplier on FPGA , 2017, IET Comput. Digit. Tech..

[40]  Neil Burgess,et al.  Design of the ARM VFP11 Divide and Square Root Synthesisable Macrocell , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[41]  P.-M. Seidel Multiple path IEEE floating-point fused multiply-add , 2003, 2003 46th Midwest Symposium on Circuits and Systems.

[42]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[43]  G Vamshi Krishna,et al.  Floating-Point Butterfly Architecture Based On Binary Signed-Digit Representation , 2018 .

[44]  Javier D. Bruguera,et al.  Floating-point multiply-add-fused with reduced latency , 2004, IEEE Transactions on Computers.

[45]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[46]  E.E. Swartzlander,et al.  Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[47]  Eric M. Schwarz,et al.  FPU implementations with denormalized numbers , 2005, IEEE Transactions on Computers.

[48]  Seok-Bum Ko,et al.  Improved Hybrid Memory Cube for Weight-Sharing Deep Convolutional Neural Networks , 2019, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[49]  David Gregg,et al.  Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks , 2016, IEEE Computer Architecture Letters.

[50]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[51]  Erdem Hokenek,et al.  Design of the IBM RISC System/6000 Floating-Point Execution Unit , 1990, IBM J. Res. Dev..

[52]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[53]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[54]  Dhireesha Kudithipudi,et al.  Deep Learning Inference on Embedded Devices: Fixed-Point vs Posit , 2018, 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2).

[55]  Dionysios I. Reisis,et al.  An efficient multiple precision floating-point Multiply-Add Fused unit , 2016, Microelectron. J..

[56]  Li Shen,et al.  A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[57]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[58]  Hao Zhang,et al.  Efficient Fixed/Floating-Point Merged Mixed-Precision Multiply-Accumulate Unit for Deep Learning Processors , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[59]  Moon Ho Lee,et al.  Performance Analysis of Bit-Width Reduced Floating-Point Arithmetic Units in FPGAs: A Case Study of Neural Network-Based Face Detector , 2009, EURASIP J. Embed. Syst..

[60]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[61]  Nicolas Brunie,et al.  Modified Fused Multiply and Add for Exact Low Precision Product Accumulation , 2017, 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH).

[62]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[63]  Nong Xiao,et al.  Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support , 2012, IEEE Transactions on Computers.

[64]  Román Hermida,et al.  Ultra-low-power adder stage design for exascale floating point units , 2014, ACM Trans. Embed. Comput. Syst..

[65]  Earl E. Swartzlander,et al.  Bridge Floating-Point Fused Multiply-Add Design , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[66]  Javier D. Bruguera,et al.  Floating-point fused multiply-add: reduced latency for floating-point addition , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[67]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[68]  M. Saunders,et al.  Solving Multiscale Linear Programs Using the Simplex Method in Quadruple Precision , 2015 .

[69]  Hao Zhang,et al.  New Flexible Multiple-Precision Multiply-Accumulate Unit for Deep Neural Network Training and Inference , 2020, IEEE Transactions on Computers.

[70]  Evangelos Eleftheriou,et al.  Mixed-precision training of deep neural networks using computational memory , 2017, ArXiv.

[71]  Andrew S. Cassidy,et al.  A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[72]  Satoshi Matsuoka,et al.  Hardware Implementation of POSITs and Their Application in FPGAs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[73]  Luca Benini,et al.  FlexFloat: A Software Library for Transprecision Computing , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[74]  Javier D. Bruguera,et al.  Leading-One Prediction with Concurrent Position Correction , 1999, IEEE Trans. Computers.

[75]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[76]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[77]  Zhaoxia Deng,et al.  Reduced-Precision Memory Value Approximation for Deep Learning , 2015 .

[78]  Michael J. Schulte,et al.  Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support , 2009, IEEE Transactions on Computers.

[79]  Ki-Seok Chung,et al.  CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.

[80]  John L. Gustafson,et al.  Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[81]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[82]  Seok-Bum Ko,et al.  Efficient Posit Multiply-Accumulate Unit Generator for Deep Learning Applications , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[83]  Yu Cao,et al.  Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[84]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.