Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information

Previous studies have confirmed that by augmenting acoustic features with the place/manner of articulatory features, the speech enhancement (SE) process can be guided to consider the broad phonetic properties of the input speech when performing enhancement to attain performance improvements. In this paper, we explore the contextual information of articulatory attributes as additional information to further benefit SE. More specifically, we propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition (E2E-ASR) model that predicts the sequence of broad phonetic classes (BPCs). We also developed multi-objective training with ASR and perceptual losses to train the SE system based on a BPC-based E2E-ASR. Experimental results from speech denoising, speech dereverberation, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance. Moreover, the SE model trained with the BPC-based E2E-ASR outperforms that with the phoneme-based E2E-ASR. The results suggest that objectives with misclassification of phonemes by the ASR system may lead to imperfect feedback, and BPC could be a potentially better choice. Finally, it is noted that combining the most-confusable phonetic targets into the same BPC when calculating the additional objective can effectively improve the SE performance.

[1]  Shinji Watanabe,et al.  Improving Speech Enhancement through Fine-Grained Speech Characteristics , 2022, INTERSPEECH.

[2]  Eric Fosler-Lussier,et al.  Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR , 2021, ArXiv.

[3]  Yu Tsao,et al.  Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement , 2021, Interspeech.

[4]  Jeih-Weih Hung,et al.  TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Wenwu Wang,et al.  Convolutional fusion network for monaural speech enhancement , 2021, Neural Networks.

[6]  Ryandhimas E. Zezario,et al.  Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[7]  Jianhua Tao,et al.  Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Sabato Marco Siniscalchi,et al.  Transfer Learning of Articulatory Information Through Phone Information , 2020, INTERSPEECH.

[9]  Saurabh Kataria,et al.  Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Tsao,et al.  Incorporating Broad Phonetic Information for Speech Enhancement , 2020, INTERSPEECH.

[11]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[12]  Kuldip K. Paliwal,et al.  DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Chin-Hui Lee,et al.  Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Tomohiro Nakatani,et al.  Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Chin-Hui Lee,et al.  Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hirokazu Kameoka,et al.  Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[17]  Shinji Watanabe,et al.  Speech Enhancement Using End-to-End Speech Recognition Objectives , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Changliang Li,et al.  Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition , 2019, INTERSPEECH.

[19]  Jian Yao,et al.  Coarse-to-fine Optimization for Speech Enhancement , 2019, INTERSPEECH.

[20]  Jun Du,et al.  A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Yong Xu,et al.  An Attention-based Neural Network Approach for Single Channel Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  DeLiang Wang,et al.  Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yu Tsao,et al.  Incorporating Symbolic Sequential Modeling for Speech Enhancement , 2019, INTERSPEECH.

[24]  DeLiang Wang,et al.  Two-Stage Deep Learning for Noisy-Reverberant Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[27]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[28]  Eric Fosler-Lussier,et al.  Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[30]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[31]  Yu Tsao,et al.  A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  DeLiang Wang,et al.  Deep learning reinvents the hearing aid , 2017, IEEE Spectrum.

[34]  Jinsong Zhang,et al.  Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks , 2016, Journal of Signal Processing Systems.

[35]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[36]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[37]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[39]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Chin-Hui Lee,et al.  An artificial neural network approach to automatic speech processing , 2014, Neurocomputing.

[43]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[44]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Chin-Hui Lee,et al.  Exploiting deep neural networks for detection-based speech recognition , 2013, Neurocomputing.

[46]  C. Lopes,et al.  Broad phonetic class definition driven by phone confusions , 2012, EURASIP J. Adv. Signal Process..

[47]  Daniel P. W. Ellis,et al.  Using Broad Phonetic Group Experts for Improved Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  R. Martin,et al.  New speech enhancement techniques for low bit rate speech coding , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[49]  Richard V. Cox,et al.  A modular approach to speech enhancement with an application to speech coding , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[50]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[53]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[54]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[55]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[56]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.