End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

[1]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[4]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Mark Hasegawa-Johnson,et al.  Deep Learning Based Speech Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[10]  Tara N. Sainath,et al.  Performance of Mask Based Statistical Beamforming in a Smart Home Scenario , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[12]  John R. Hershey,et al.  Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[14]  Haizhou Li,et al.  Beamforming networks using spatial covariance features for far-field speech recognition , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[15]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.

[16]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[17]  Douglas L. Jones,et al.  A Study of Learning Based Beamforming Methods for Speech Recognition , 2016 .

[18]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yong Xu,et al.  End-to-End Multi-Channel Speech Separation , 2019, ArXiv.

[20]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[21]  Jiaxin Li,et al.  SO-Net: Self-Organizing Network for Point Cloud Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Choong-Sang Cho,et al.  Estimation of MVDR Beamforming Weights Based on Deep Neural Network , 2018 .

[25]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).