LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update

Representing deep neural networks (DNNs) in low-precision is a promising approach to enable efficient acceleration and memory reduction. Previous methods that train DNNs in low-precision typically keep a copy of weights in high-precision during the weight updates. Directly training with low-precision weights leads to accuracy degradation due to complex interactions between the low-precision number systems and the learning algorithms. To address this issue, we develop a co-designed low-precision training framework, termed LNS-Madam, in which we jointly design a logarithmic number system (LNS) and a multiplicative weight update algorithm (Madam). We prove that LNS-Madam results in low quantization error during weight updates, leading to stable performance even if the precision is limited. We further propose a hardware design of LNS-Madam that resolves practical challenges in implementing an efficient datapath for LNS computations. Our implementation effectively reduces energy overhead incurred by LNS-to-integer conversion and partial sum accumulation. Experimental results show that LNS-Madam achieves comparable accuracy to full-precision counterparts with only 8 bits on popular computer vision and natural language tasks. Compared to FP32 and FP8, LNS-Madam reduces the energy consumption by over 90% and 55%, respectively.

[1]  Tianshi Chen,et al.  Cambricon-Q: A Hybrid Architecture for Efficient Training , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[2]  William J. Dally,et al.  VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference , 2021, MLSys.

[3]  Brian Chmiel,et al.  Neural gradients are near-lognormal: improved quantized and sparse training , 2020, ICLR.

[4]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.

[5]  Joseph E. Gonzalez,et al.  A Statistical Framework for Low-bitwidth Training of Deep Neural Networks , 2020, NeurIPS.

[6]  Yisong Yue,et al.  Learning compositional functions via multiplicative weight updates , 2020, NeurIPS.

[7]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[8]  Jeff Johnson Efficient, arbitrarily high precision hardware logarithmic arithmetic for linear algebra , 2020, 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).

[9]  Sri Parameswaran,et al.  REALM: Reduced-Error Approximate Log-based Integer Multiplier , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Xianglong Liu,et al.  Towards Unified INT8 Training for Convolutional Neural Network , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Peter A. Beerel,et al.  Neural Network Training with Approximate Logarithmic Computations , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Swagath Venkataramani,et al.  Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[13]  2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) , 2023, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  William J. Dally,et al.  MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[15]  Sri Parameswaran,et al.  Approximate Integer and Floating-Point Dividers with Near-Zero Error Bias , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[16]  Charbel Sakr,et al.  Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm , 2018, ICLR.

[17]  Kamyar Azizzadenesheli,et al.  signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.

[18]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[19]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[22]  Gerd Ascheid,et al.  Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary log-base , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[23]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[24]  Sri Parameswaran,et al.  Minimally Biased Multipliers for Approximate Integer and Floating-Point Multiplication , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[26]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[27]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[28]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[29]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[30]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Daisuke Miyashita,et al.  LogNet: Energy-efficient neural networks using logarithmic computation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[33]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[34]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[35]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[36]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  T. Sejnowski,et al.  Nanoconnectomic upper bound on the variability of synaptic plasticity , 2015, eLife.

[39]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[40]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[43]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[44]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[45]  Alice C. Parker,et al.  The high-level synthesis of digital systems , 1990, Proc. IEEE.

[46]  E. V. Krishnamurthy,et al.  On Computer Multiplication and Division Using Binary Logarithms , 1963, IEEE Transactions on Electronic Computers.