论文信息 - Bfloat16 Processing for Neural Networks

Bfloat16 Processing for Neural Networks

Bfloat16 ("BF16") is a new floating-point format tailored specifically for high-performance processing of Neural Networks and will be supported by major CPU and GPU architectures as well as Neural Network accelerators. This paper proposes a possible implementation of a BF16 multiply-accumulation operation that relaxes several IEEE Floating-Point Standard features to afford low-cost hardware implementations. Specifically, subnorms are flushed to zero; only one non-standard rounding mode (Round-Odd) is supported; NaNs are not propagated; and IEEE exception flags are not provided. The paper shows that this approach achieves the same network-level accuracy as using IEEE single-precision arithmetic ("FP32") for less than half the datapath area cost and with greater throughput.

[1] Guillaume Melquiond,et al. Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd , 2008, IEEE Transactions on Computers.

[2] Pradeep Dubey,et al. Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[3] Daniel Brand,et al. Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5] David R. Lutz. Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[6] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.