论文信息 - Ultra-Lightweight Speech Separation Via Group Communication

Ultra-Lightweight Speech Separation Via Group Communication

Model size and complexity remain the biggest challenges in the deployment of speech enhancement and separation systems on low-resource devices such as earphones and hearing aids. Although methods such as compression, distillation and quantization can be applied to large models, they often come with a cost on the model performance. In this paper, we provide a simple model design paradigm that explicitly designs ultra-lightweight models without sacrificing the performance. Motivated by the sub-band frequency-LSTM (F-LSTM) architectures, we introduce the group communication (GroupComm), where a feature vector is split into smaller groups and a small processing block is used to perform inter-group communication. Unlike standard F-LSTM models where the sub-band outputs are concatenated, an ultra-small module is applied on all the groups in parallel, which allows a significant decrease on the model size. Experiment results show that comparing with a strong baseline model which is already lightweight, GroupComm can achieve on par performance with 35.6 times fewer parameters and 2.3 times fewer operations.

[1] Tara N. Sainath,et al. Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[2] Takuya Yoshioka,et al. End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Geoffrey Zweig,et al. Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Chin-Hui Lee,et al. Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Nils L. Westhausen,et al. Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[6] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[7] Lei Xie,et al. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[8] DeLiang Wang,et al. A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Alex Fit-Florea,et al. Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[10] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11] Xiaofei Li,et al. Multichannel Speech Enhancement Based On Time-Frequency Masking Using Subband Long Short-Term Memory , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[14] Jianxin Wu,et al. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] DeLiang Wang,et al. Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Haizhou Li,et al. Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Tomohiro Nakatani,et al. Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation , 2020, INTERSPEECH.

[18] DeLiang Wang,et al. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[19] Matthew Mattina,et al. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Antonio Miguel,et al. gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.

[22] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23] Hands-free Speech Communications and Microphone Arrays, HSCMA 2017, San Francisco, CA, USA, March 1-3, 2017 , 2017, HSCMA.

[24] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[25] Jun Du,et al. Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[26] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Geoffrey Zweig,et al. LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[28] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29] Jun Du,et al. Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[30] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[31] Neural Architecture Search for Speech Recognition , 2020, ArXiv.

[32] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Zhiwei Xiong,et al. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[34] Hanna Mazzawi,et al. Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale , 2019, INTERSPEECH.

[35] Yi Luo,et al. Distortion-Controlled Training for end-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[36] P. Smaragdis,et al. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation , 2020, International Workshop on Machine Learning for Signal Processing.

[37] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[38] Takuya Yoshioka,et al. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).