An Overview of Neural Network Compression

Overparameterized networks trained to convergence have shown impressive performance in domains such as computer vision and natural language processing. Pushing state of the art on salient tasks within these domains corresponds to these models becoming larger and more difficult for machine learning practitioners to use given the increasing memory and storage requirements, not to mention the larger carbon footprint. Thus, in recent years there has been a resurgence in model compression techniques, particularly for deep convolutional neural networks and self-attention based networks such as the Transformer. Hence, this paper provides a timely overview of both old and current compression techniques for deep neural networks, including pruning, quantization, tensor decomposition, knowledge distillation and combinations thereof. We assume a basic familiarity with deep learning architectures\footnote{For an introduction to deep learning, see ~\citet{goodfellow2016deep}}, namely, Recurrent Neural Networks~\citep[(RNNs)][]{rumelhart1985learning,hochreiter1997long}, Convolutional Neural Networks~\citep{fukushima1980neocognitron}~\footnote{For an up to date overview see~\citet{khan2019survey}} and Self-Attention based networks~\citep{vaswani2017attention}\footnote{For a general overview of self-attention networks, see ~\citet{chaudhari2019attentive}.},\footnote{For more detail and their use in natural language processing, see~\citet{hu2019introductory}}. Most of the papers discussed are proposed in the context of at least one of these DNN architectures.

[1]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[2]  C. G. Broyden A Class of Methods for Solving Nonlinear Simultaneous Equations , 1965 .

[3]  Wonyong Sung,et al.  Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[4]  Rongrong Ji,et al.  Accelerating Convolutional Networks via Global & Dynamic Filter Pruning , 2018, IJCAI.

[5]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[6]  Giovanna Castellano,et al.  An iterative pruning algorithm for feedforward neural networks , 1997, IEEE Trans. Neural Networks.

[7]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[8]  Asifullah Khan,et al.  A survey of the recent architectures of deep convolutional neural networks , 2019, Artificial Intelligence Review.

[9]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[10]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  Ehud D. Karnin,et al.  A simple procedure for pruning back-propagation trained neural networks , 1990, IEEE Trans. Neural Networks.

[13]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Yue Wang,et al.  Drawing early-bird tickets: Towards more efficient training of deep networks , 2019, ICLR.

[16]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[17]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[18]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[19]  Mário A. T. Figueiredo,et al.  Learning to Share: simultaneous parameter tying and Sparsification in Deep Learning , 2018, ICLR.

[20]  Anahita Bhiwandiwalla,et al.  Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , 2020, ICLR.

[21]  James T. Kwok,et al.  Loss-aware Binarization of Deep Networks , 2016, ICLR.

[22]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[23]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning , 2016, ArXiv.

[24]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[25]  Xianglong Liu,et al.  Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Gerhard Rigoll,et al.  Convolutional Neural Networks with Layer Reuse , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[28]  Kyoung Mu Lee,et al.  Deeply-Recursive Convolutional Network for Image Super-Resolution , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[30]  Phillip Isola,et al.  Contrastive Representation Distillation , 2020, ICLR.

[31]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[32]  Kalyanmoy Deb,et al.  A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[33]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Shingo Mabu,et al.  Enhancing the generalization ability of neural networks through controlling the hidden layers , 2009, Appl. Soft Comput..

[35]  David Kappel,et al.  Deep Rewiring: Training very sparse deep networks , 2017, ICLR.

[36]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[37]  S. Liberty,et al.  Linear Systems , 2010, Scientific Parallel Computing.

[38]  David P. Wipf,et al.  Compressing Neural Networks using the Variational Information Bottleneck , 2018, ICML.

[39]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[40]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[41]  Dichao Hu,et al.  An Introductory Survey on Attention Mechanisms in NLP Problems , 2018, IntelliSys.

[42]  Wee Kheng Leow,et al.  Pruned Neural Networks for Regression , 2000, PRICAI.

[43]  Michael T. Manry,et al.  An integrated growing-pruning method for feedforward network training , 2008, Neurocomputing.

[44]  Brian Kingsbury,et al.  Knowledge distillation across ensembles of multilingual models for low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[46]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[47]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[48]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[49]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[50]  Yaroslav Bulatov,et al.  Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , 2013, ICLR.

[51]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[52]  Elad Hoffer,et al.  ACIQ: Analytical Clipping for Integer Quantization of neural networks , 2018, ArXiv.

[53]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[54]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[55]  Yisong Yue,et al.  Long-term Forecasting using Tensor-Train RNNs , 2017, ArXiv.

[56]  Yoshua Bengio,et al.  Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation , 2016, Front. Comput. Neurosci..

[57]  Rohan Ramanath,et al.  An Attentive Survey of Attention Models , 2019, ACM Trans. Intell. Syst. Technol..

[58]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[59]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[60]  Rui Zhang,et al.  KDGAN: Knowledge Distillation with Generative Adversarial Networks , 2018, NeurIPS.

[61]  Wei Liu,et al.  Neural Compatibility Modeling with Attentive Knowledge Distillation , 2018, SIGIR.

[62]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[63]  Dipankar Das,et al.  Mixed Precision Training With 8-bit Floating Point , 2019, ArXiv.

[64]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2019, ICLR.

[65]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[66]  Michael Maire,et al.  Learning Implicitly Recurrent CNNs Through Parameter Sharing , 2019, ICLR.

[67]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[68]  Tie-Yan Liu,et al.  Neural Architecture Optimization , 2018, NeurIPS.

[69]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[70]  Vladlen Koltun,et al.  Deep Equilibrium Models , 2019, NeurIPS.

[71]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[72]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[73]  Svetlana Lazebnik,et al.  Piggyback: Adding Multiple Tasks to a Single, Fixed Network by Learning to Mask , 2018, ArXiv.

[74]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[75]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[76]  Jin Young Choi,et al.  Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons , 2018, AAAI.

[77]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[78]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[79]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[80]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[81]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[82]  Yun Fu,et al.  Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[83]  Yaim Cooper,et al.  The loss landscape of overparameterized neural networks , 2018, ArXiv.

[84]  Rémi Gribonval,et al.  And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[85]  Thad Starner,et al.  Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[86]  Sangwook Cho,et al.  Understanding Knowledge Distillation , 2020 .

[87]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[88]  Andries Petrus Engelbrecht,et al.  A new pruning heuristic based on variance analysis of sensitivity information , 2001, IEEE Trans. Neural Networks.

[89]  Vincent Lepetit,et al.  Learning Separable Filters , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Zhe Gan,et al.  Distilling Knowledge Learned in BERT for Text Generation , 2019, ACL.

[91]  Vladlen Koltun,et al.  Trellis Networks for Sequence Modeling , 2018, ICLR.

[92]  Niraj K. Jha,et al.  NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm , 2017, IEEE Transactions on Computers.

[93]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[94]  Surya Ganguli,et al.  Pruning neural networks without any data by iteratively conserving synaptic flow , 2020, NeurIPS.

[95]  Yurong Chen,et al.  Explicit Loss-Error-Aware Quantization for Low-Bit Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[96]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[98]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[99]  Alfred Jean Philippe Lauret,et al.  A node pruning algorithm based on a Fourier amplitude sensitivity test method , 2006, IEEE Transactions on Neural Networks.

[100]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[102]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[103]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[104]  Song Han,et al.  AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[105]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[106]  Dharmendra S. Modha,et al.  Deep neural networks are robust to weight binarization and other non-linear distortions , 2016, ArXiv.

[107]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[108]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[109]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[110]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[111]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[112]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[113]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[114]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[115]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[116]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[117]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[118]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[119]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[120]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[121]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[122]  Torsten Hoefler,et al.  Shapeshifter Networks: Cross-layer Parameter Sharing for Scalable and Effective Deep Learning , 2020, ArXiv.

[123]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[124]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[125]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[126]  Christoph H. Lampert,et al.  Towards Understanding Knowledge Distillation , 2019, ICML.

[127]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[128]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[129]  Mingkui Tan,et al.  NAT: Neural Architecture Transformer for Accurate and Compact Architectures , 2019, NeurIPS.

[130]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[131]  Yann LeCun,et al.  Understanding Deep Architectures using a Recursive Convolutional Network , 2013, ICLR.

[132]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[133]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[134]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[135]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[136]  Srinidhi Hegde,et al.  Variational Student: Learning Compact and Sparser Networks In Knowledge Distillation Framework , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[137]  Lieven De Lathauwer,et al.  Decompositions of a Higher-Order Tensor in Block Terms - Part II: Definitions and Uniqueness , 2008, SIAM J. Matrix Anal. Appl..

[138]  Soheil Ghiasi,et al.  Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[139]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[140]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[141]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[142]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[143]  Jingbo Zhu,et al.  Sharing Attention Weights for Fast Transformer , 2019, IJCAI.

[144]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[145]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[146]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[147]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.

[148]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[149]  Jiwen Lu,et al.  Runtime Neural Pruning , 2017, NIPS.

[150]  Vineeth N. Balasubramanian,et al.  Deep Model Compression: Distilling Knowledge from Noisy Teachers , 2016, ArXiv.

[151]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[152]  Yiming Hu,et al.  A novel channel pruning method for deep neural network compression , 2018, ArXiv.

[153]  Nicholas Rhinehart,et al.  N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning , 2017, ICLR.

[154]  Victor S. Lempitsky,et al.  Fast ConvNets Using Group-Wise Brain Damage , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[155]  L. Darrell Whitley,et al.  Genetic algorithms and neural networks: optimizing connections and connectivity , 1990, Parallel Comput..

[156]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[157]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[158]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[159]  Christopher A. Walsh,et al.  Peter Huttenlocher (1931–2013) , 2013, Nature.

[160]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[161]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[162]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[163]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[164]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[165]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[166]  Hadi Esmaeilzadeh,et al.  ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018 .

[167]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[168]  Zenglin Xu,et al.  Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[169]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[170]  Hassan Ghasemzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher , 2019, ArXiv.

[171]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[172]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[173]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[174]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[175]  Xiaodong Liu,et al.  Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[176]  Hadi Esmaeilzadeh,et al.  ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018, ArXiv.

[177]  Zhe Gan,et al.  Distilling the Knowledge of BERT for Text Generation , 2019, ArXiv.

[178]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[179]  Wei Pan,et al.  Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[180]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[181]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[182]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[183]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[184]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[185]  John E. Moody,et al.  Fast Pruning Using Principal Components , 1993, NIPS.

[186]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[187]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[188]  Masafumi Hagiwara,et al.  Removal of hidden units and weights for back propagation networks , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[189]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[190]  Prem Raj Adhikari,et al.  Multiresolution Mixture Modeling using Merging of Mixture Components , 2012, ACML.

[191]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[192]  Soheil Feizi,et al.  Compressing GANs using Knowledge Distillation , 2019, ArXiv.

[193]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Training Pruned Neural Networks , 2018, ArXiv.

[194]  Junjie Yan,et al.  Dynamic Recursive Neural Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[195]  Jian Yang,et al.  Image Super-Resolution via Deep Recursive Residual Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[196]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[197]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[198]  Il-Chul Moon,et al.  Adversarial Dropout for Supervised and Semi-supervised Learning , 2017, AAAI.

[199]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[200]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[201]  Markus Freitag,et al.  Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[202]  Liujuan Cao,et al.  Towards Optimal Structured CNN Pruning via Generative Adversarial Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[203]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[204]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[205]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[206]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[207]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[208]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[209]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[210]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[211]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[212]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[213]  Gregory J. Wolff,et al.  Optimal Brain Surgeon: Extensions and performance comparisons , 1993, NIPS 1993.

[214]  Atsushi Fujita,et al.  Recurrent Stacking of Layers for Compact Neural Machine Translation Models , 2018, AAAI.

[215]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[216]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[217]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[218]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[219]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[220]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[221]  Jun-Fei Qiao,et al.  A structure optimisation algorithm for feedforward neural network construction , 2013, Neurocomputing.

[222]  Fei Han,et al.  A Neural Network Pruning Method Optimized with PSO Algorithm , 2010, 2010 Second International Conference on Computer Modeling and Simulation.

[223]  Lucas Theis,et al.  Faster gaze prediction with dense networks and Fisher pruning , 2018, ArXiv.

[224]  Eriko Nurvitadhi,et al.  WRPN: Wide Reduced-Precision Networks , 2017, ICLR.

[225]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[226]  Parul Parashar,et al.  Neural Networks in Machine Learning , 2014 .

[227]  Lihi Zelnik-Manor,et al.  ASAP: Architecture Search, Anneal and Prune , 2019, AISTATS.

[228]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[229]  Erick Cantú-Paz Pruning Neural Networks with Distribution Estimation Algorithms , 2003, GECCO.

[230]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[231]  Larry S. Davis,et al.  NISP: Pruning Networks Using Neuron Importance Score Propagation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[232]  Richard F. Lyon,et al.  Neural Networks for Machine Learning , 2017 .

[233]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.