Phoneme-Based Distribution Regularization for Speech Enhancement

Existing speech enhancement methods mainly separate speech from noises at the signal level or in the time-frequency domain. They seldom pay attention to the semantic information of a corrupted signal. In this paper, we aim to bridge this gap by extracting phoneme identities to help speech enhancement. Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner. As different phonemes always lead to different feature distributions in frequency, we propose to learn a parameter pair, i.e. scale and bias, through a phoneme classification vector to modulate the speech enhancement network. The modulation parameter pair includes not only frame-wise but also frequency-wise conditions, which effectively map features to phoneme-related distributions. In this way, we explicitly regularize speech enhancement features by recognition vectors. Experiments on public datasets demonstrate that the proposed PbDr module can not only boost the perceptual quality for speech enhancement but also the recognition accuracy of an ASR system on the enhanced speech. This PbDr module could be readily incorporated into other speech enhancement networks as well.

[1]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006, IEEE Trans. Neural Networks.

[5]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[7]  Zhong-Qiu Wang,et al.  Phoneme-specific speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kostas Kokkinakis,et al.  Time-Frequency Masking for Blind Source Separation with Preserved Spatial Cues , 2017, INTERSPEECH.

[9]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Roland Badeau,et al.  Joint Phoneme Alignment and Text-Informed Speech Separation on Highly Corrupted Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sharon Gannot,et al.  A phoneme-based pre-training approach for deep neural network with application to speech enhancement , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[12]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[13]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[14]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  DeLiang Wang,et al.  On Adversarial Training and Loss Functions for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[17]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[18]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).