Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update

Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts. One promising approach to reduce the energy costs is representing DNNs with low-precision numbers. While it is common to train DNNs with forward and backward propagation in low-precision, training directly over low-precision weights, without keeping a copy of weights in high-precision, still remains to be an unsolved problem. This is due to complex interactions between learning algorithms and low-precision number systems. To address this, we jointly design a low-precision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam. LNS has a high dynamic range even in a low-bitwidth setting, leading to high energy efficiency and making it relevant for on-board training in energy-constrained edge devices. We design LNS to have the flexibility of choosing different bases for weights and gradients, as they usually require different quantization gaps and dynamic ranges during training. By drawing the connection between LNS and multiplicative update, LNS-Madam ensures low quantization error during weight update, leading to a stable convergence even if the bitwidth is limited. Compared to using a fixed-point or floating-point number system and training with popular learning algorithms such as SGD and Adam, our joint design with LNS and LNS-Madam optimizer achieves better accuracy while requiring smaller bitwidth. Notably, with only 5-bit for gradients, the proposed training framework achieves accuracy comparable to full-precision state-of-the-art models such as ResNet-50 and BERT. To verify the efficiency of our framework, we also conduct energy estimations by analyzing the math datapath units during training. We calculate that our design achieves over 60x energy reduction compared to FP32 on BERT models. For full training of ResNet-50 on ImageNet, our design reduces the carbon emissions by 98% around. ∗Work done during an internship at NVIDIA Research. Preprint. Under review. ar X iv :2 10 6. 13 91 4v 1 [ cs .L G ] 2 6 Ju n 20 21 1.7B 7.5B 39B 145B 530B 1TB Number of Parameters in GPT Models 1 10 10 10 10 E ne rg y C os t p er It er at io n (m J)

[1]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[2]  William J. Dally,et al.  MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[3]  Swagath Venkataramani,et al.  Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[4]  Charbel Sakr,et al.  Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm , 2018, ICLR.

[5]  Sri Parameswaran,et al.  Minimally Biased Multipliers for Approximate Integer and Floating-Point Multiplication , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[7]  William J. Dally,et al.  VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference , 2021, MLSys.

[8]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[9]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[10]  Sri Parameswaran,et al.  Approximate Integer and Floating-Point Dividers with Near-Zero Error Bias , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[11]  Yisong Yue,et al.  Learning compositional functions via multiplicative weight updates , 2020, NeurIPS.

[12]  Brian Chmiel,et al.  Neural gradients are near-lognormal: improved quantized and sparse training , 2020, ICLR.

[13]  Sri Parameswaran,et al.  REALM: Reduced-Error Approximate Log-based Integer Multiplier , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Alice C. Parker,et al.  The high-level synthesis of digital systems , 1990, Proc. IEEE.

[15]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[16]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[17]  Jeff Johnson Efficient, arbitrarily high precision hardware logarithmic arithmetic for linear algebra , 2020, 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).

[18]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[19]  T. Sejnowski,et al.  Nanoconnectomic upper bound on the variability of synaptic plasticity , 2015, eLife.

[20]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[23]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  Gerd Ascheid,et al.  Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary log-base , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[30]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[31]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[32]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[33]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[34]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[35]  Daisuke Miyashita,et al.  LogNet: Energy-efficient neural networks using logarithmic computation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[37]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[38]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[39]  E. V. Krishnamurthy,et al.  On Computer Multiplication and Division Using Binary Logarithms , 1963, IEEE Transactions on Electronic Computers.

[40]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Kamyar Azizzadenesheli,et al.  signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.

[42]  Joseph E. Gonzalez,et al.  A Statistical Framework for Low-bitwidth Training of Deep Neural Networks , 2020, NeurIPS.