Demystifying TasNet: A Dissecting Approach

In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the TasNet system is reached, thus blending components of frequency domain approaches with those of time domain approaches. Some of the intermediate variants achieve comparable signal-to-distortion ratio (SDR) gains to TasNet, but retain the advantage of frequency domain processing: compatibility with classic signal processing tools such as frequency-domain beamforming and the human interpretability of the masks. Furthermore, we show that the scale invariant signal-to-distortion ratio (si-SDR) criterion used as loss function in TasNet is related to a logarithmic mean square error criterion and that it is this criterion which contributes most reliable to the performance advantage of TasNet. Finally, we critically assess which gains in a noise-free single channel environment generalize to more realistic reverberant conditions.

[1]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[2]  Timo Gerkmann,et al.  A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Nima Mesgarani,et al.  TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[4]  Reinhold Haeb-Umbach,et al.  SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition , 2019, ArXiv.

[5]  Yonghong Yan,et al.  Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming , 2018, INTERSPEECH.

[6]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tomi Kinnunen,et al.  INTERSPEECH 2013 14thAnnual Conference of the International Speech Communication Association , 2013, Interspeech 2015.

[8]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Mark Hasegawa-Johnson,et al.  Glottal Model Based Speech Beamforming for ad-hoc Microphone Arrays , 2017, INTERSPEECH.

[13]  Naoya Takahashi,et al.  Recursive speech separation for unknown number of speakers , 2019, INTERSPEECH.

[14]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[15]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[16]  Reinhold Haeb-Umbach,et al.  Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation , 2019, IEEE Journal of Selected Topics in Signal Processing.

[17]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[18]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Liu Liu,et al.  FurcaNet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation , 2019, ArXiv.

[23]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  John R. Hershey,et al.  Phasebook and Friends: Leveraging Discrete Representations for Source Separation , 2018, IEEE Journal of Selected Topics in Signal Processing.

[26]  Yong Xu,et al.  End-to-End Multi-Channel Speech Separation , 2019, ArXiv.

[27]  Tomohiro Nakatani,et al.  Listening to Each Speaker One by One with Recurrent Selective Hearing Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.