Ultra-Lightweight Speech Separation Via Group Communication

Model size and complexity remain the biggest challenges in the deployment of speech enhancement and separation systems on low-resource devices such as earphones and hearing aids. Although methods such as compression, distillation and quantization can be applied to large models, they often come with a cost on the model performance. In this paper, we provide a simple model design paradigm that explicitly designs ultra-lightweight models without sacrificing the performance. Motivated by the sub-band frequency-LSTM (F-LSTM) architectures, we introduce the group communication (GroupComm), where a feature vector is split into smaller groups and a small processing block is used to perform inter-group communication. Unlike standard F-LSTM models where the sub-band outputs are concatenated, an ultra-small module is applied on all the groups in parallel, which allows a significant decrease on the model size. Experiment results show that comparing with a strong baseline model which is already lightweight, GroupComm can achieve on par performance with 35.6 times fewer parameters and 2.3 times fewer operations.

[1]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.

[2]  Takuya Yoshioka,et al.  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Geoffrey Zweig,et al.  Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chin-Hui Lee,et al.  Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Nils L. Westhausen,et al.  Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[6]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[7]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[8]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[10]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11]  Xiaofei Li,et al.  Multichannel Speech Enhancement Based On Time-Frequency Masking Using Subband Long Short-Term Memory , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[14]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tomohiro Nakatani,et al.  Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation , 2020, INTERSPEECH.

[18]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[19]  Matthew Mattina,et al.  TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.

[22]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Hands-free Speech Communications and Microphone Arrays, HSCMA 2017, San Francisco, CA, USA, March 1-3, 2017 , 2017, HSCMA.

[24]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[25]  Jun Du,et al.  Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[26]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Geoffrey Zweig,et al.  LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[28]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[30]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[31]  Neural Architecture Search for Speech Recognition , 2020, ArXiv.

[32]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[34]  Hanna Mazzawi,et al.  Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale , 2019, INTERSPEECH.

[35]  Yi Luo,et al.  Distortion-Controlled Training for end-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[36]  P. Smaragdis,et al.  Sudo RM -RF: Efficient Networks for Universal Audio Source Separation , 2020, International Workshop on Machine Learning for Signal Processing.

[37]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[38]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).