Neural Speech Separation Using Spatially Distributed Microphones

This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network architecture is proposed that interleaves inter-channel processing layers and temporal processing layers. The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones. The temporal processing layers are based on a bidirectional long short term memory (BLSTM) model and applied to each channel independently. The proposed network leverages information across time and space by stacking these two kinds of layers alternately. Our network estimates time-frequency (TF) masks for each speaker, which are then used to generate enhanced speech signals either with TF masking or beamforming. Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Shoko Araki,et al.  Meeting Recognition with Asynchronous Distributed Microphone Array Using Block-Wise Refinement of Mask-Based MVDR Beamformer , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Zhijie Yan,et al.  Improving latency-controlled BLSTM acoustic models for online speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[8]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Andreas Stolcke,et al.  Meeting Transcription Using Virtual Microphone Arrays , 2019, ArXiv.

[11]  Zicheng Liu Sound Source Seperation with Distributed Microphone Arrays in the Presence of Clocks Synchronization Errors , 2017 .

[12]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[14]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[16]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[17]  Takuya Yoshioka,et al.  Advances in Online Audio-Visual Meeting Transcription , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[20]  Takuya Yoshioka,et al.  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[23]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.