The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning

Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to consistent order-of-magnitude speedups? We present a framework called ZipML to answer these questions. For linear models, the answer is yes. We develop a simple framework based on one simple but novel strategy called double sampling. Our framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias. We validate our framework across a range of applications, and show that it enables an FPGA prototype that is up to 6.5x faster than an implementation using full 32-bit precision. We further develop a variance-optimal stochastic quantization strategy and show that it can make a significant difference in a variety of settings. When applied to linear models together with double sampling, we save up to another 1.7x in data movement compared with uniform quantization. When training deep networks with quantized models, we achieve higher accuracy than the state-of-the-art XNOR-Net. Finally, we extend our framework through approximation to non-linear models, such as SVM. We show that, although using low-precision data induces bias, we can appropriately bound and control the bias. We find in practice 8-bit precision is often sufficient to converge to the correct solution. Interestingly, however, in practice we notice that our framework does not always outperform the naive rounding approach. We discuss this negative result in detail.

[1]  H. Nyquist,et al.  Certain Topics in Telegraph Transmission Theory , 1928, Transactions of the American Institute of Electrical Engineers.

[2]  William H. Richardson,et al.  Bayesian-Based Iterative Method of Image Restoration , 1972 .

[3]  L. Lucy An iterative technique for the rectification of observed distributions , 1974 .

[4]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[5]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[6]  J. Sola,et al.  Importance of input data normalization for the application of neural networks to complex industrial problems , 1997 .

[7]  F. Courbin,et al.  Deconvolution with Correct Sampling , 1997, astro-ph/9704059.

[8]  Fionn Murtagh,et al.  Deconvolution in Astronomy: A Review , 2002 .

[9]  G. Meylan,et al.  On-axis spectroscopy of the host galaxies of 20 optically luminous quasars at z ~ 0.3 , 2006, astro-ph/0605288.

[10]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[11]  Jieping Ye,et al.  SVM versus Least Squares SVM , 2007, AISTATS.

[12]  D. Hall Measurement Error in Nonlinear Models: A Modern Perspective , 2008 .

[13]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[14]  Jeffrey A. Fessler,et al.  Hardware acceleration of iterative image reconstruction for X-ray computed tomography , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  R. Nichol,et al.  Euclid Definition Study Report , 2011, 1110.3193.

[16]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[17]  M. Vlcek CHEBYSHEV POLYNOMIAL APPROXIMATION FOR ACTIVATION SIGMOID FUNCTION , 2012 .

[18]  Prateek Jain,et al.  One-Bit Compressed Sensing: Provable Support and Vector Recovery , 2013, ICML.

[19]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[20]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Ce Liu,et al.  Deep Convolutional Neural Network for Image Deconvolution , 2014, NIPS.

[23]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[24]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[25]  Hilo,et al.  THE ELEVENTH AND TWELFTH DATA RELEASES OF THE SLOAN DIGITAL SKY SURVEY: FINAL DATA FROM SDSS-III , 2015, 1501.00963.

[26]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[27]  Chinmay Hegde,et al.  Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms , 2015, PODS.

[28]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[29]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[31]  Yuanzhi Li,et al.  Faster Principal Component Regression via Optimal Polynomial Approximation to sgn(x) , 2016, ArXiv.

[32]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[33]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[34]  Dan Alistarh,et al.  QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[35]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[36]  Aaron Sidford,et al.  Principal Component Projection Without Principal Component Analysis , 2016, ICML.

[37]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[38]  Paris Smaragdis,et al.  Bitwise Neural Networks , 2016, ArXiv.

[39]  F. Courbin,et al.  Firedec: a two-channel finite-resolution image deconvolution algorithm , 2016, 1602.02167.

[40]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[41]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[43]  Ce Zhang,et al.  Generative Adversarial Networks recover features in astrophysical images of galaxies beyond the deconvolution limit , 2017, ArXiv.

[44]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[45]  Yuanzhi Li,et al.  Faster Principal Component Regression and Stable Matrix Chebyshev Approximation , 2017, ICML.

[46]  Gustavo Alonso,et al.  FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[47]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..