Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired. In this paper, we propose to perform self-supervised pre-training to stabilize the label assignment in training the speech separation model. Experiments over several types of self-supervised approaches, several typical speech separation models and two different datasets showed that very good improvements are achievable if a proper self-supervised approach is chosen.

[1]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[3]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[4]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[6]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[7]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[8]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[9]  Joan Bruna,et al.  Source separation with scattering Non-Negative Matrix Factorization , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Hung-yi Lee,et al.  Interrupted and Cascaded Permutation Invariant Training for Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[15]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jun Wang,et al.  Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[21]  Soheil Khorram,et al.  Probabilistic Permutation Invariant Training for Speech Separation , 2019, INTERSPEECH.

[22]  Antoine Deleforge,et al.  LibriMix: An Open-Source Dataset for Generalizable Speech Separation , 2020, 2005.11262.

[23]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[24]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[25]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[27]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[30]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[31]  Efthymios Tzinis,et al.  Asteroid: the PyTorch-based audio source separation toolkit for researchers , 2020, INTERSPEECH.

[32]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).