FPRaker: A Processing Element For Accelerating Neural Network Training

We present FPRaker, a processing element for composing training accelerators. FPRaker processes several floating-point multiply-accumulation operations concurrently and accumulates their result into a higher precision accumulator. FPRaker boosts performance and energy efficiency during training by taking advantage of the values that naturally appear during training. Specifically, it processes the significand of the operands of each multiply-accumulate as a series of signed powers of two. The conversion to this form is done on-the-fly. This exposes ineffectual work that can be skipped: values when encoded have few terms and some of them can be discarded as they would fall outside the range of the accumulator given the limited precision of floating-point. We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using conventional floating-point units under ISO-compute area constraints. We also demonstrate that FPRaker delivers additional benefits when training incorporates pruning and quantization. Finally, we show that FPRaker naturally amplifies performance with training methods that use a different precision per layer.

[1]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[2]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[3]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[4]  Dongyoung Kim,et al.  ZeNA: Zero-Aware Neural Network Accelerator , 2018, IEEE Design & Test.

[5]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[7]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[8]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[9]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[10]  J. Ticehurst Cacti , 1983 .

[11]  Dylan Malone Stuart,et al.  Laconic Deep Learning Inference Acceleration , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[12]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[13]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[14]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[15]  David Blaauw,et al.  Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[16]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[17]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[18]  Hoi-Jun Yoo,et al.  UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[19]  A KonstanJoseph,et al.  The MovieLens Datasets , 2015 .

[20]  Xuehai Qian,et al.  HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Xu Sun,et al.  Memorized Sparse Backpropagation , 2019, Neurocomputing.

[22]  Charbel Sakr,et al.  Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks , 2019, ICLR.

[23]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Pran Kurup,et al.  Logic synthesis using Synopsys (2nd ed.) , 1997 .

[25]  Jitesh R. Shinde,et al.  VLSI implementation of bit serial architecture based multiplier in floating point arithmetic , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[26]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[27]  Alexander M. Rush,et al.  MASR: A Modular Accelerator for Sparse RNNs , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Jyh-Charn Liu,et al.  Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[29]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[30]  Shaoli Liu,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  Cordelia Schmid,et al.  End-to-End Incremental Learning , 2018, ECCV.

[32]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[33]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Yuxin Peng,et al.  Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification , 2014, ACM Multimedia.

[35]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[36]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[37]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Constantine Bekas,et al.  Incremental Training of Deep Convolutional Neural Networks , 2018, AutoML@PKDD/ECML.

[39]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[42]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[43]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Walter H. Ku,et al.  A bit-serial floating-point complex multiplier-accumulator for fault-tolerant digital signal processing arrays , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[46]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[47]  Anand Jayarajan,et al.  Priority-based parameter propagation for distributed deep neural network training , 2019 .

[48]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[49]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[50]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[51]  Peter W. Cook,et al.  Second-generation RISC floating point with multiply-add fused , 1990 .

[52]  David Goldberg,et al.  What every computer scientist should know about floating-point arithmetic , 1991, CSUR.

[53]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[54]  Patrick Judd,et al.  Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks , 2017, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[55]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[56]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[57]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[58]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[59]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[60]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[61]  Alexander Heinecke,et al.  Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[62]  Patrick Judd,et al.  ShapeShifter: Enabling Fine-Grain Data Width Adaptation in Deep Learning , 2019, MICRO.

[63]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[64]  Xu Sun,et al.  meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.

[65]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[66]  Sangeetha Abdu Jyothi,et al.  Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems , 2018, ArXiv.

[67]  Babak Falsafi,et al.  Training DNNs with Hybrid Block Floating Point , 2018, NeurIPS.

[68]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[69]  Patrick Judd,et al.  Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[70]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[71]  Engin Ipek,et al.  Enabling Scientific Computing on Memristive Accelerators , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[72]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[73]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[74]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[75]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.