Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet

From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while keeping the training GPU-hours tractable. In this paper, we propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes. We develop training strategies for Omni-sparsity DNN that allows it to find models along the Pareto front of word-error-rate (WER) vs model size while keeping the training GPU-hours to no more than that of training one singular model. We demonstrate the Omnisparsity DNN with streaming E2E ASR models. Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned sparse models: 2%-6.6% better WER on Test-other.

[1]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Vikas Chandra,et al.  Collaborative Training of Acoustic Encoders for Speech Recognition , 2021, Interspeech.

[4]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[5]  Martin Jaggi,et al.  Simultaneous Training of Partially Masked Neural Networks , 2021, ArXiv.

[6]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[7]  Ning Xu,et al.  Slimmable Neural Networks , 2018, ICLR.

[8]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shiliang Zhang,et al.  Extremely Low Footprint End-to-End ASR System for Smart Device , 2021, Interspeech.

[10]  Dilin Wang,et al.  AlphaNet: Improved Training of Supernet with Alpha-Divergence , 2021, ICML.

[11]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[12]  Ding Zhao,et al.  Dynamic Sparsity Neural Networks for Automatic Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dhananjaya N. Gowda,et al.  A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms , 2020, 2020 54th Asilomar Conference on Signals, Systems, and Computers.

[14]  Gregory Frederick Diamos,et al.  Block-Sparse Recurrent Neural Networks , 2017, ArXiv.

[15]  Thomas S. Huang,et al.  Universally Slimmable Networks and Improved Training Techniques , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[17]  Liang Qiao,et al.  Optimizing Speech Recognition For The Edge , 2019, ArXiv.

[18]  Quoc V. Le,et al.  BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models , 2020, ECCV.

[19]  Gil Keren,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[20]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[21]  Frank Zhang,et al.  Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition , 2020, ArXiv.

[22]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.