Rethinking floating point for deep learning

Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our 8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96x the power and 1.12x the area of 8/32-bit integer multiply-add. In 16 bits, our log float multiply-add is 0.59x the power and 0.68x the area of IEEE 754 float16 fused multiply-add, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.

[1]  Lin Sun,et al.  FPGA-based training of convolutional neural networks with a reduced precision floating-point library , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[2]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[3]  Mark Horowitz,et al.  FPU Generator for Design Space Exploration , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[4]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[5]  Yangqing Jia,et al.  High performance ultra-low-precision convolutions on mobile devices , 2017, ArXiv.

[6]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[9]  Earl E. Swartzlander,et al.  The Sign/Logarithm Number System , 1975, IEEE Transactions on Computers.

[10]  Jirí Kadlec,et al.  Arithmetic on the European Logarithmic Microprocessor , 2000, IEEE Trans. Computers.

[11]  Florent de Dinechin,et al.  Design-space exploration for the Kulisch accumulator , 2017 .

[12]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[13]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[14]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[15]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[16]  R. Morris,et al.  Tapered Floating Point: A New Floating-Point Representation , 1971, IEEE Transactions on Computers.

[17]  Ulrich W. Kulisch,et al.  Advanced Arithmetic for the Digital Computer, Design of Arithmetic Units , 2002, RealComp.

[18]  Ulrich W. Kulisch,et al.  Computer Arithmetic and Validity - Theory, Implementation, and Applications , 2008, de Gruyter studies in mathematics.

[19]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[21]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[22]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  James Hardy Wilkinson,et al.  Rounding errors in algebraic processes , 1964, IFIP Congress.

[25]  John L. Gustafson,et al.  Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[26]  N. Kingsbury,et al.  Digital filtering using logarithmic arithmetic , 1971 .

[27]  John L. Gustafson,et al.  The End of Error: Unum Computing , 2015 .

[28]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[29]  E.E. Swartzlander,et al.  Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[32]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[33]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Peter Lindstrom,et al.  Universal coding of the reals: alternatives to IEEE floating point , 2018 .