A Greedy Algorithm for Quantizing Neural Networks

We propose a new computationally efficient method for quantizing the weights of pre-trained neural networks that is general enough to handle both multi-layer perceptrons and convolutional neural networks. Our method deterministically quantizes layers in an iterative fashion with no complicated re-training required. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. This simple algorithm is equivalent to running a dynamical system, which we prove is stable for quantizing a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are Gaussian. We show that under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods on MNIST and CIFAR10 data.

[1]  B. Hajek Hitting-time and occupation-time bounds implied by drift analysis with applications , 1982, Advances in Applied Probability.

[2]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[3]  I. Daubechies,et al.  Approximating a bandlimited function using very coarsely quantized data: A family of stable sigma-delta modulators of arbitrary order , 2003 .

[4]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[5]  Thomas Rothvoß,et al.  Constructive Discrepancy Minimization for Convex Sets , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[6]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[7]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[8]  Wojciech Banaszczyk,et al.  Balancing vectors and Gaussian measures of n-dimensional convex bodies , 1998, Random Struct. Algorithms.

[9]  A. Giannopoulos On some vector balancing problems , 1997 .

[10]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[11]  Mohit Singh,et al.  Efficient algorithms for discrepancy minimization in convex sets , 2014, Random Struct. Algorithms.

[12]  M. Ajtai The shortest vector problem in L2 is NP-hard for randomized reductions (extended abstract) , 1998, STOC '98.

[13]  Shachar Lovett,et al.  Constructive Discrepancy Minimization by Walking on the Edges , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[18]  Mohit Singh,et al.  Discrepancy Without Partial Colorings , 2014, APPROX-RANDOM.

[19]  J. Spencer Six standard deviations suffice , 1985 .

[20]  H. Inose,et al.  A Telemetering System by Code Modulation - Δ- ΣModulation , 1962, IRE Transactions on Space Electronics and Telemetry.

[21]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[22]  Pierre Baldi,et al.  The capacity of feedforward neural networks , 2019, Neural Networks.

[23]  Wojciech Banaszczyk,et al.  A Beck - Fiala-type Theorem for Euclidean Norms , 1990, Eur. J. Comb..

[24]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[25]  G KoldaTamara,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998 .

[26]  Serguei Popov,et al.  Non-homogeneous Random Walks: Lyapunov Function Methods for Near-Critical Stochastic Systems , 2016 .

[27]  László Lovász,et al.  Discrepancy of Set-systems and Matrices , 1986, Eur. J. Comb..

[28]  Yunhui Guo,et al.  A Survey on Methods and Theories of Quantized Neural Networks , 2018, ArXiv.

[29]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[30]  Nikhil Bansal,et al.  Constructive Algorithms for Discrepancy Minimization , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[31]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[32]  R. Pemantle,et al.  Moment conditions for a sequence with negative drift to be uniformly bounded in Lr , 1999, math/0404093.

[33]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[34]  Narendra Ahuja,et al.  Cresceptron: a self-organizing neural network which grows adaptively , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[35]  Jian Cheng,et al.  Fixed-Point Factorized Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Joel A. Tropp,et al.  Binary component decomposition Part II: The asymmetric case , 2019, ArXiv.

[37]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[38]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[39]  Shachar Lovett,et al.  The Gram-Schmidt walk: a cure for the Banaszczyk blues , 2017, STOC.

[40]  Shachar Lovett,et al.  Towards a Constructive Version of Banaszczyk's Vector Balancing Theorem , 2016, APPROX-RANDOM.