End-to-End DNN Training with Block Floating Point Arithmetic

DNNs are ubiquitous datacenter workloads, requiring orders of magnitude more computing power from servers than traditional workloads. As such, datacenter operators are forced to adopt domain-specific accelerators that employ half-precision floating-point (FP) numeric representations to improve arithmetic density. Unfortunately, even these representations are not dense enough, and are, therefore, sub-optimal for DNNs. We propose a hybrid approach that employs dense block floating-point (BFP) arithmetic on dot product computations and FP arithmetic elsewhere. While using BFP improves the performance of dot product operations, that compose most of DNN computations, allowing values to freely float between dot product operations leads to a better choice of tensor exponents when converting values to back BFP. We show that models trained with hybrid BFP-FP arithmetic either match or outperform their FP32 counterparts, leading to more compact models and denser arithmetic in computing platforms.

[1]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[3]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[4]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[7]  Zhenyu Liu,et al.  Computation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design , 2017, AAAI.

[8]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[11]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[12]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[15]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[16]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[17]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.