Bfloat16 Processing for Neural Networks

Bfloat16 ("BF16") is a new floating-point format tailored specifically for high-performance processing of Neural Networks and will be supported by major CPU and GPU architectures as well as Neural Network accelerators. This paper proposes a possible implementation of a BF16 multiply-accumulation operation that relaxes several IEEE Floating-Point Standard features to afford low-cost hardware implementations. Specifically, subnorms are flushed to zero; only one non-standard rounding mode (Round-Odd) is supported; NaNs are not propagated; and IEEE exception flags are not provided. The paper shows that this approach achieves the same network-level accuracy as using IEEE single-precision arithmetic ("FP32") for less than half the datapath area cost and with greater throughput.

[1]  Guillaume Melquiond,et al.  Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd , 2008, IEEE Transactions on Computers.

[2]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[3]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  David R. Lutz Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[6]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.