Continuous Speech Separation: Dataset and Analysis

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior speech separation studies use pre-segmented audio signals, which are typically generated by mixing speech utterances on computers so that they fully overlap. Also, the separation algorithms have often been evaluated based on signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, speech signals are continuous and contain both overlapped and overlap-free regions. In addition, the signal-based metrics only have weak correlation with automatic speech recognition (ASR) accuracy. Not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a continuous audio stream that contains multiple utterances that are partially overlapped by a varying degree. A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is established by using a well-trained multi-conditional acoustic model. A recently proposed speaker-independent CSS algorithm is investigated by using LibriCSS. The dataset and evaluation scripts are made available to facilitate the research in this direction1.

[1]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[2]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Elizabeth Shriberg,et al.  Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition , 2006, INTERSPEECH.

[4]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Emmanuel Vincent,et al.  Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[8]  John R. Hershey,et al.  VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[9]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[11]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[12]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[15]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[16]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[21]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Haizhou Li,et al.  Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[27]  Ian McLoughlin,et al.  Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Jasha Droppo,et al.  Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Naoyuki Kanda,et al.  Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Zhuo Chen,et al.  Single-channel Speech Extraction Using Speaker Inventory and Attention Network , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[33]  Roland Maas,et al.  DiPCo - Dinner Party Corpus , 2019, INTERSPEECH.

[34]  DeLiang Wang,et al.  Two-Stage Deep Learning for Noisy-Reverberant Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[36]  Rémi Gribonval,et al.  BSS_EVAL Toolbox User Guide -- Revision 2.0 , 2005 .

[37]  Liang Lu,et al.  PyKaldi2: Yet another speech toolkit based on Kaldi and PyTorch , 2019, ArXiv.

[38]  Liu Liu,et al.  FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[39]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).