Compressing Deep Neural Networks for Efficient Speech Enhancement

The use of deep neural networks (DNNs) has dramatically improved the performance of speech enhancement in the past decade. However, a large DNN is typically required to achieve strong enhancement performance, and this kind of model is both computationally intensive and memory consuming. Hence it is difficult to deploy such DNNs on devices with limited hardware resources or in applications with strict latency requirements. In order to address this problem, we propose a model compression pipeline to reduce DNN size for speech enhancement, which is based on three kinds of techniques: sparse regularization, iterative pruning and clustering-based quantization. Evaluation results show that our approach substantially reduces the sizes of different DNNs without significantly affecting their enhancement performance. Moreover, we find that training and compressing a large DNN yields higher STOI and PESQ than directly training a small DNN that has a comparable size to the compressed DNN. This further suggests the benefits of using the proposed model compression approach.

[1]  Yu Tsao,et al.  Increasing Compactness of Deep Learning Based Speech Enhancement Models With Parameter Pruning and Quantization Techniques , 2019, IEEE Signal Processing Letters.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[4]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[5]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[6]  Michelle Guo,et al.  Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[8]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[9]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[10]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[12]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[13]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[14]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[16]  WangDeLiang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013 .

[17]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[18]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yu Tsao,et al.  Subjective Feedback-based Neural Network Pruning for Speech Enhancement , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[20]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[21]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[22]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[23]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[24]  Indrajit Chakrabarti,et al.  Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement , 2018, ArXiv.

[25]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  DeLiang Wang,et al.  Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Yu Wang,et al.  Exploring the Granularity of Sparsity in Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).