Top-KAST: Top-K Always Sparse Training

Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, during training. For very large models this requirement can be prohibitive. In this work we propose Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes). We demonstrate the efficacy of our approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling where the current best performing architectures tend to have tens of billions of parameters and scaling up does not yet seem to have saturated performance. Sparse versions of these architectures can be run with significantly fewer resources, making them more widely accessible and applicable. Furthermore, in addition to being effective, our approach is straightforward and can easily be implemented in a wide range of existing machine learning frameworks with only a few additional lines of code. We therefore hope that our contribution will help enable the broader community to explore the potential held by massive models, without incurring massive computational cost.

[1]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[2]  Emile Fiesler,et al.  Evaluating pruning methods , 1995 .

[3]  Nikko Ström,et al.  Sparse connection and pruning in large dynamic artificial neural networks , 1997, EUROSPEECH.

[4]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[5]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[6]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[7]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[8]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[12]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[13]  Xinyu Yang,et al.  A Survey on Internet of Things: Architecture, Enabling Technologies, Security and Privacy, and Applications , 2017, IEEE Internet of Things Journal.

[14]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[15]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[16]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[17]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[18]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[19]  Bing Chen,et al.  Data Security and Privacy-Preserving in Edge Computing Paradigm: Survey and Open Issues , 2018, IEEE Access.

[20]  David Kappel,et al.  Deep Rewiring: Training very sparse deep networks , 2017, ICLR.

[21]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[22]  Tom Schaul,et al.  Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods , 2019, ArXiv.

[23]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[24]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[25]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[26]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Guillaume Lample,et al.  Augmenting Self-attention with Persistent Memory , 2019, ArXiv.

[28]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Ali Farhadi,et al.  Discovering Neural Wirings , 2019, NeurIPS.

[31]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[32]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[33]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[34]  Graham D. Riley,et al.  Estimation of energy consumption in machine learning , 2019, J. Parallel Distributed Comput..

[35]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[36]  Erich Elsen,et al.  The Difficulty of Training Sparse Neural Networks , 2019, ArXiv.

[37]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[38]  P. S. Castro,et al.  Rigging the Lottery: Making All Tickets Winners , 2019, ICML.

[39]  S. Kakade,et al.  Soft Threshold Weight Reparameterization for Learnable Sparsity , 2020, ICML.

[40]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.