Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation

Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array ge-ometries. However, whether such explicit beamforming operation is a necessary and valid formulation remains unclear. In this paper, we investigate the beamforming operation and show that it is not necessary. To further improve the performance, we change the explicit waveform-level filter-and-sum operation into an implicit feature-level filter-and-sum operation around a context of features. A feature-level normalized cross correlation (fNCC) feature is also proposed to better match the implicit operation for an improved performance. Experiment results on a simulated ad-hoc microphone array dataset show that the proposed modification to the FaSNet, which we refer to as the implicit filter-and-sum network (iFaSNet), achieve better performance than the explicit FaSNet with a similar model size and a faster training and inference speed.

[1]  Yi Luo,et al.  Distortion-Controlled Training for end-to-end Reverberant Speech Separation with Auxiliary Autoencoding Loss , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[2]  Dong Yu,et al.  ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.

[4]  Bhiksha Raj,et al.  Efficient Integration of Multi-channel Information for Speaker-independent Speech Separation , 2020, ArXiv.

[5]  Dong Yu,et al.  Neural Spatio-Temporal Beamformer for Target Speech Separation , 2020, INTERSPEECH.

[6]  Tomohiro Nakatani,et al.  Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yong Xu,et al.  Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Bin Liu,et al.  Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features , 2020, ArXiv.

[9]  Yuexian Zou,et al.  Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation , 2020, ArXiv.

[10]  N. Mesgarani,et al.  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  T. Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yu Tsao,et al.  Multichannel Speech Enhancement by Raw Waveform-Mapping Using Fully Convolutional Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Yong Xu,et al.  End-to-End Multi-Channel Speech Separation , 2019, ArXiv.

[15]  Yong Xu,et al.  Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[18]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Performance of Mask Based Statistical Beamforming in a Smart Home Scenario , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Mark Hasegawa-Johnson,et al.  Deep Learning Based Speech Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[22]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[26]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Douglas L. Jones,et al.  Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms. , 2004, The Journal of the Acoustical Society of America.

[31]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .