Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layerwise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.

[1]  Jonathan Le Roux,et al.  Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[5]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Takuya Yoshioka,et al.  Don’t Shoot Butterfly with Rifles: Multi-Channel Continuous Speech Separation with Early Exit Transformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jonathan Le Roux,et al.  End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jonathan Le Roux,et al.  Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[9]  Shinji Watanabe,et al.  Student-Teacher Learning for BLSTM Mask-based Speech Enhancement , 2018, INTERSPEECH.

[10]  Chengyi Wang,et al.  Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[11]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[14]  Yashesh Gaur,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[15]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[18]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Eliyahu Kiperwasser,et al.  Scheduled Multi-Task Learning: From Syntax to Translation , 2018, TACL.

[20]  Ming Zhou,et al.  Continuous Speech Separation with Conformer , 2020, ArXiv.

[21]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[22]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[23]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.