Mixup-breakdown: A Consistency Training Method for Improving Generalization of Speech Separation Models

Deep-learning based speech separation models confront poor generalization problem that even the state-of-the-art models could abruptly fail when evaluating them in mismatch conditions. To address this problem, we propose an easy-to-implement yet effective consistency based semi-supervised learning (SSL) approach, namely Mixup-Breakdown training (MBT). It learns a teacher model to "breakdown" unlabeled inputs, and the estimated separations are interpolated to produce more useful pseudo "mixup" input-output pairs, on which the consistency regularization could apply for learning a student model. In our experiment, we evaluate MBT under various conditions with ascending degrees of mismatch, including unseen interfering speech, noise, and music, and compare MBT’s generalization capability against state-of-the-art supervised learning and SSL approaches. The result indicates that MBT significantly outperforms several strong baselines with up to 13.77% relative SI-SNRi improvement. Moreover, MBT only adds negligible computational overhead to standard training schemes.

[1]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Tolga Tasdizen,et al.  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[3]  Esben Jannik Bjerrum,et al.  SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules , 2017, ArXiv.

[4]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Lin-Shan Lee,et al.  Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering , 2019, INTERSPEECH.

[7]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[9]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[10]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[12]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[13]  Nima Mesgarani,et al.  Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network , 2018, INTERSPEECH.

[14]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Jun Wang,et al.  Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures , 2018, INTERSPEECH.

[17]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[19]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[22]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Bo Zhang,et al.  Smooth Neighbors on Teacher Graphs for Semi-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[26]  O. Chapelle,et al.  Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , 2009, IEEE Transactions on Neural Networks.

[27]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[28]  Nima Mesgarani,et al.  TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[29]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[30]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[31]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[32]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).