A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

[1]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[2]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[3]  Nojun Kwak,et al.  Position-based Scaled Gradient for Model Quantization and Sparse Training , 2020, ArXiv.

[4]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[5]  B. Shastri,et al.  Dynamic Precision Analog Computing for Neural Networks , 2021, IEEE Journal of Selected Topics in Quantum Electronics.

[6]  Yan Wang,et al.  Rotated Binary Neural Network , 2020, NeurIPS.

[7]  S. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 by Stephen M. Stigler (review) , 1986, Technology and Culture.

[8]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Hai Victor Habi,et al.  HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs , 2020, ECCV.

[10]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[11]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[12]  Ian D. Reid,et al.  Towards Effective Low-Bitwidth Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Qi Tian,et al.  Data-Free Learning of Student Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Georgios Tzimiropoulos,et al.  Training Binary Neural Networks with Real-to-Binary Convolutions , 2020, ICLR.

[15]  C. John Glossner,et al.  Pruning and Quantization for Deep Neural Network Acceleration: A Survey , 2021, Neurocomputing.

[16]  Jing Jin,et al.  KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization , 2021, ArXiv.

[17]  Wei Pan,et al.  Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[18]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[19]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[20]  Vikas Singh,et al.  A Biresolution Spectral Framework for Product Quantization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Bo Chen,et al.  NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications , 2018, ECCV.

[23]  Jian Cheng,et al.  Generative Zero-shot Network Quantization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Rémi Gribonval,et al.  And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[25]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[26]  C. Shannon Coding Theorems for a Discrete Source With a Fidelity Criterion-Claude , 2009 .

[27]  Swagath Venkataramani,et al.  BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[28]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[29]  Jiwen Lu,et al.  Learning Channel-Wise Interactions for Binary Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bingbing Ni,et al.  Variational Convolutional Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dan Alistarh,et al.  Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..

[32]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[33]  Kurt Keutzer,et al.  Hessian-Aware Pruning and Optimal Neural Implant , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[34]  Kurt Keutzer,et al.  Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Song Han,et al.  AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[36]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[37]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[38]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[39]  Sanguthevar Rajasekaran,et al.  AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters , 2019, NeurIPS.

[40]  Maxim Naumov,et al.  On Periodic Functions as Regularizers for Quantization of Neural Networks , 2018, ArXiv.

[41]  Song Han,et al.  HAQ: Hardware-Aware Automated Quantization , 2018, ArXiv.

[42]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[43]  Luca Benini,et al.  Leveraging Automated Mixed-Low-Precision Quantization for Tiny Edge Microcontrollers , 2020, IoT Streams/ITEM@PKDD/ECML.

[44]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Diana Marculescu,et al.  One Weight Bitwidth to Rule Them All , 2020, ECCV Workshops.

[46]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[47]  Ian D. Reid,et al.  Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Lav R. Varshney,et al.  Optimal Information Storage in Noisy Synapses under Resource Constraints , 2006, Neuron.

[50]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Sinno Jialin Pan,et al.  MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization , 2019, NeurIPS.

[52]  Rongrong Ji,et al.  Accelerating Convolutional Networks via Global & Dynamic Filter Pruning , 2018, IJCAI.

[53]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[55]  Neil D. Lawrence,et al.  Variational Information Distillation for Knowledge Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Nilanjan Ray,et al.  Layer Importance Estimation with Imprinting for Neural Network Quantization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[57]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Bo Xu,et al.  Distilled Binary Neural Network for Monaural Speech Separation , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[59]  B. Kailkhura,et al.  Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network , 2021, ICLR.

[60]  Vikas Chandra,et al.  CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , 2018, ArXiv.

[61]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Jingdong Wang,et al.  Distillation-Guided Residual Learning for Binary Convolutional Neural Networks , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[63]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[64]  Alexander Finkelstein,et al.  Fighting Quantization Bias With Bias , 2019, ArXiv.

[65]  Jiwen Lu,et al.  Learning Deep Binary Descriptor with Multi-Quantization , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[67]  Jinwon Lee,et al.  LSQ+: Improving low-bit quantization through learnable offsets and better initialization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[68]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[69]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[70]  Xin Dong,et al.  A Main/Subsidiary Network Framework for Simplifying Binary Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[72]  Ying Wang,et al.  Bayesian Bits: Unifying Quantization and Pruning , 2020, NeurIPS.

[73]  Yiran Chen,et al.  BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization , 2021, ICLR.

[74]  James J. Little,et al.  LSQ++: Lower Running Time and Higher Recall in Multi-codebook Quantization , 2018, ECCV.

[75]  Zhenyu Liao,et al.  Sparse Quantized Spectral Clustering , 2020, ArXiv.

[76]  B.M. Oliver,et al.  The Philosophy of PCM , 1948, Proceedings of the IRE.

[77]  Song Han,et al.  APQ: Joint Search for Network Architecture, Pruning and Quantization Policy , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[79]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[80]  Xin Dong,et al.  Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Greg Mori,et al.  CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[83]  Yuandong Tian,et al.  Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search , 2018, ArXiv.

[84]  Xianglong Liu,et al.  Balanced Binary Neural Networks with Gated Residual , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Derek Hoiem,et al.  Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[88]  Javier Duarte,et al.  Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference , 2021, Frontiers in Artificial Intelligence.

[89]  Jungwon Lee,et al.  Towards the Limit of Network Quantization , 2016, ICLR.

[90]  Prad Kadambi Comparing Fisher Information Regularization with Distillation for DNN Quantization , 2020 .

[91]  Markus Nagel,et al.  Data-Free Quantization Through Weight Equalization and Bias Correction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[92]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[93]  Tao Yue,et al.  Distribution-aware Adaptive Multi-bit Quantization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[95]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[96]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[97]  Ling Shao,et al.  TBN: Convolutional Neural Network with Ternary Inputs and Binary Weights , 2018, ECCV.

[98]  Uri Weiser,et al.  Post-Training Sparsity-Aware Quantization , 2021, NeurIPS.

[99]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.

[100]  Jian Cheng,et al.  From Hashing to CNNs: Training BinaryWeight Networks via Hashing , 2018, AAAI.

[101]  Dhireesha Kudithipudi,et al.  Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge , 2019, ArXiv.

[102]  Bingbing Ni,et al.  Performance Guaranteed Network Acceleration via High-Order Residual Quantization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[103]  Masashi Sugiyama,et al.  Learning Efficient Tensor Representations with Ring-structured Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[104]  James T. Kwok,et al.  Loss-aware Binarization of Deep Networks , 2016, ICLR.

[105]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[106]  Yang Yang,et al.  BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction , 2021, ICLR.

[107]  David Thorsley,et al.  Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.

[108]  Xianglong Liu,et al.  Forward and Backward Information Retention for Accurate Binary Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  P. P. Kanjilal,et al.  Reduced-size neural networks through singular value decomposition and subset selection , 1993 .

[110]  Zhenyu Liao,et al.  AdaBits: Neural Network Quantization With Adaptive Bit-Widths , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  Luca Benini,et al.  GAP-8: A RISC-V SoC for AI at the Edge of the IoT , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[112]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[113]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[114]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[115]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[116]  Qun Liu,et al.  TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[117]  Brian Chmiel,et al.  Neural gradients are near-lognormal: improved quantized and sparse training , 2020, ICLR.

[118]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[119]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[120]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[121]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[122]  David Thorsley,et al.  Near-Lossless Post-Training Quantization of Deep Neural Networks via a Piecewise Linear Approximation , 2020, ArXiv.

[123]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[124]  Desmond P. Taylor,et al.  Is Information in the Brain Represented in Continuous or Discrete Form? , 2018, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[125]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[126]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[127]  Anahita Bhiwandiwalla,et al.  Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , 2020, ICLR.

[128]  Larry S. Davis,et al.  NISP: Pruning Networks Using Neuron Importance Score Propagation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[129]  Georgios Tzimiropoulos,et al.  High-Capacity Expert Binary Networks , 2020, ICLR.

[130]  Ji Liu,et al.  Automatic Neural Network Compression by Sparsity-Quantization Joint Learning: A Constrained Optimization-Based Approach , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[131]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[132]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[133]  Martin Rinard,et al.  Efficient Exact Verification of Binarized Neural Networks , 2020, NeurIPS.

[134]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[135]  Eunhyeok Park,et al.  Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[136]  Diana Marculescu,et al.  Regularizing Activation Distribution for Training Binarized Deep Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[137]  Mingkui Tan,et al.  Generative Low-bitwidth Data Free Quantization , 2020, ECCV.

[138]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[139]  Wonyong Sung,et al.  Fixed-point performance analysis of recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[140]  Eriko Nurvitadhi,et al.  WRPN: Wide Reduced-Precision Networks , 2017, ICLR.

[141]  Pedro M. Domingos,et al.  Deep Learning as a Mixed Convex-Combinatorial Optimization Problem , 2017, ICLR.

[142]  W. Sheppard On the Calculation of the most Probable Values of Frequency‐Constants, for Data arranged according to Equidistant Division of a Scale , 1897 .

[143]  Jungwon Lee,et al.  Learning Low Precision Deep Neural Networks through Regularization , 2018, ArXiv.

[144]  Boris Flach,et al.  Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks , 2020, NeurIPS.

[145]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[146]  Maja Pantic,et al.  Improved training of binary networks for human pose estimation and image recognition , 2019, ArXiv.

[147]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[148]  C. Koch,et al.  Is perception discrete or continuous? , 2003, Trends in Cognitive Sciences.

[149]  Quoc V. Le,et al.  Swish: a Self-Gated Activation Function , 2017, 1710.05941.

[150]  Jack Xin,et al.  Blended coarse gradient descent for full quantization of deep neural networks , 2018, Research in the Mathematical Sciences.

[151]  Mouloud Belbahri,et al.  BNN+: Improved Binary Network Training , 2018, ArXiv.

[152]  Ray C. C. Cheung,et al.  Accurate and Compact Convolutional Neural Networks with Trained Binarization , 2019, BMVC.

[153]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[154]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[155]  Yoshua Bengio,et al.  Neural Networks with Few Multiplications , 2015, ICLR.

[156]  Hyungjun Kim,et al.  BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations , 2020, ICLR.

[157]  Max Welling,et al.  Gradient 𝓁1 Regularization for Quantization Robustness , 2020, ArXiv.

[158]  Michael R. Lyu,et al.  BinaryBERT: Pushing the Limit of BERT Quantization , 2020, ACL.

[159]  Yan Wang,et al.  Fully Quantized Network for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[160]  Dharmendra S. Modha,et al.  Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference , 2018, ArXiv.

[161]  Lothar Thiele,et al.  Adaptive Loss-Aware Quantization for Multi-Bit Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[162]  Zhenzhi Wu,et al.  GXNOR-Net: Training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework , 2017, Neural Networks.

[163]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[164]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[165]  Paris Smaragdis,et al.  Bitwise Neural Networks , 2016, ArXiv.

[166]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[167]  Kwang-Ting Cheng,et al.  Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization , 2019, NeurIPS.

[168]  Matthew Mattina,et al.  Learning low-precision neural networks without Straight-Through Estimator(STE) , 2019, IJCAI.

[169]  Eirikur Agustsson,et al.  Universally Quantized Neural Compression , 2020, NeurIPS.

[170]  Roberto Cipolla,et al.  Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[171]  Avi Mendelson,et al.  UNIQ , 2018, ACM Trans. Comput. Syst..

[172]  Kurt Keutzer,et al.  HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2020, NeurIPS.

[173]  Uri Weiser,et al.  Robust Quantization: One Model to Rule Them All , 2020, NeurIPS.

[174]  Kush R. Varshney,et al.  Decision Making With Quantized Priors Leads to Discrimination , 2017, Proceedings of the IEEE.

[175]  Jonathan W. Pillow,et al.  Single-trial spike trains in parietal cortex reveal discrete steps during decision-making , 2015, Science.

[176]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[177]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[178]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[179]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[180]  Wei Wang,et al.  Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks , 2020, ICLR.

[181]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[182]  Lane A. Hemaspaandra,et al.  Using simulated annealing to design good codes , 1987, IEEE Trans. Inf. Theory.

[183]  Niranjan Balasubramanian,et al.  On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers , 2021, FINDINGS.

[184]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[185]  Yurong Chen,et al.  Explicit Loss-Error-Aware Quantization for Low-Bit Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[186]  James T. Kwok,et al.  Loss-aware Weight Quantization of Deep Networks , 2018, ICLR.

[187]  Kuilin Chen,et al.  Incremental few-shot learning via vector quantization in deep embedded space , 2021, ICLR.

[188]  Yuandong Tian,et al.  FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[189]  Rongrong Ji,et al.  Circulant Binary Convolutional Networks: Enhancing the Performance of 1-Bit DCNNs With Circulant Back Propagation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[190]  Xianglong Liu,et al.  Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[191]  Song Han,et al.  ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware , 2018, ICLR.

[192]  Jian Cheng,et al.  Learning Compression from Limited Unlabeled Data , 2018, ECCV.

[193]  Jihwan P. Choi,et al.  Data-Free Network Quantization With Adversarial Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[194]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[195]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Adaptive Quantization for Deep Neural Network , 2017, AAAI.

[196]  Wei Liu,et al.  Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm , 2018, ECCV.

[197]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[198]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[199]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[200]  Joe Lou,et al.  Confounding Tradeoffs for Neural Network Quantization , 2021, ArXiv.

[201]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[202]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[203]  Dan Alistarh,et al.  Adaptive Gradient Quantization for Data-Parallel SGD , 2020, NeurIPS.

[204]  Rana Ali Amjad,et al.  Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[205]  Yoni Choukroun,et al.  Low-bit Quantization of Neural Networks for Efficient Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[206]  Soheil Ghiasi,et al.  Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[207]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[208]  Christoph Meinel,et al.  BoolNet: Minimizing The Energy Consumption of Binary Neural Networks , 2021, ArXiv.

[209]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[210]  Seungwon Lee,et al.  Quantization for Rapid Deployment of Deep Neural Networks , 2018, ArXiv.

[211]  Dacheng Tao,et al.  Searching for Low-Bit Weights in Quantized Neural Networks , 2020, NeurIPS.

[212]  Ron Banner,et al.  Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming , 2020, ArXiv.

[213]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[214]  Dongsoo Lee,et al.  BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[215]  Weisheng Xu,et al.  Fully integer-based quantization for mobile convolutional neural network inference , 2021, Neurocomputing.

[216]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[217]  W. R. Bennett,et al.  Spectra of quantized signals , 1948, Bell Syst. Tech. J..

[218]  Dongsoo Lee,et al.  FleXOR: Trainable Fractional Quantization , 2020, NeurIPS.

[219]  Kurt Keutzer,et al.  SqueezeNext: Hardware-Aware Neural Network Design , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[220]  Rishidev Chaudhuri,et al.  Computational principles of memory , 2016, Nature Neuroscience.

[221]  Xianglong Liu,et al.  BiPointNet: Binary Neural Network for Point Clouds , 2020, ICLR.

[222]  Michael Woodford,et al.  Discrete Adjustment to a Changing Environment: Experimental Evidence , 2016, SSRN Electronic Journal.

[223]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[224]  Masahiro Masuda,et al.  Efficient Execution of Quantized Deep Learning Models: A Compiler Approach , 2020, ArXiv.

[225]  Elad Hoffer,et al.  The Knowledge Within: Methods for Data-Free Model Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[226]  William Equitz,et al.  A new vector quantization clustering algorithm , 1989, IEEE Trans. Acoust. Speech Signal Process..

[227]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[228]  Shenghuo Zhu,et al.  Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM , 2017, AAAI.

[229]  Vishnu Naresh Boddeti,et al.  Local Binary Convolutional Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[230]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[231]  Gu-Yeon Wei,et al.  Structured Compression by Weight Encryption for Unstructured Pruning and Quantization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[232]  Kurt Keutzer,et al.  HAWQV3: Dyadic Neural Network Quantization , 2020, ICML.

[233]  Yu Bai,et al.  ProxQuant: Quantized Neural Networks via Proximal Operators , 2018, ICLR.

[234]  Amos J. Storkey,et al.  Moonshine: Distilling with Cheap Convolutions , 2017, NeurIPS.

[235]  Jack Xin,et al.  Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets , 2019, ICLR.

[236]  Vahid Partovi Nia,et al.  Adaptive Binary-Ternary Quantization , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[237]  B. Riemann Ueber die Darstellbarkeit einer Function durch eine trigonometrische Reihe , 1867 .

[238]  Dilin Wang,et al.  AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[239]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[240]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[241]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[242]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[243]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[244]  Weifeng Zhang,et al.  Simple Augmentation Goes a Long Way: ADRL for DNN Quantization , 2021, ICLR.

[245]  Nicholas D. Lane,et al.  Degree-Quant: Quantization-Aware Training for Graph Neural Networks , 2021, ICLR.

[246]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[247]  Deliang Fan,et al.  Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[248]  Kurt Keutzer,et al.  CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs , 2021, FPGA.

[249]  Alexander Finkelstein,et al.  Same, Same But Different - Recovering Neural Network Quantization Error Through Weight Factorization , 2019, ICML.

[250]  Christophe Garcia,et al.  Simplifying ConvNets for Fast Learning , 2012, ICANN.

[251]  Gang Hua,et al.  How to Train a Compact Binary Neural Network with High Accuracy? , 2017, AAAI.

[252]  Nicu Sebe,et al.  Binary Neural Networks: A Survey , 2020, Pattern Recognit..

[253]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[254]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[255]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[256]  Yurong Chen,et al.  Network Sketching: Exploiting Binary Structure in Deep CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[257]  Enhua Wu,et al.  Training Binary Neural Networks through Learning with Noisy Supervision , 2020, ICML.

[258]  Hongbin Zha,et al.  Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[259]  Dipankar Das,et al.  Mixed Precision Training With 8-bit Floating Point , 2019, ArXiv.

[260]  Jie Lin,et al.  OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization , 2021, AAAI.

[261]  Ying Wang,et al.  Differentiable Joint Pruning and Quantization for Hardware Efficiency , 2020, ECCV.

[262]  L. Pinneo On noise in the nervous system. , 1966, Psychological review.

[263]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[264]  Kurt Keutzer,et al.  I-BERT: Integer-only BERT Quantization , 2021, ICML.

[265]  Nicholas D. Lane,et al.  An Empirical study of Binary Neural Networks' Optimisation , 2018, ICLR.

[266]  Yang Liu,et al.  Two-Step Quantization for Low-bit Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[267]  C. Collin,et al.  An Introduction to Natural Computation , 1998, Trends in Cognitive Sciences.

[268]  Philip Heng Wai Leong,et al.  SYQ: Learning Symmetric Quantization for Efficient Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[269]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[270]  Vivek K. Goyal,et al.  A framework for Bayesian optimality of psychophysical laws , 2012, Journal of Mathematical Psychology.

[271]  Jinwoo Shin,et al.  Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning , 2020, ICLR.

[272]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[273]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[274]  Dacheng Tao,et al.  Learning from Multiple Teacher Networks , 2017, KDD.

[275]  Tom Goldstein,et al.  WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic , 2020, ArXiv.