Investigation of a Single-Channel Frequency-Domain Speech Enhancement Network to Improve End-to-End Bengali Automatic Speech Recognition Under Unseen Noisy Conditions

Due to the presence of distortion, most of the single-channel frequency-domain speech enhancement (SE) approaches are still challenging for downstream automatic speech recognition (ASR) tasks, even with satisfactory improvements in enhancing speech quality and intelligibility. Recently, transformer-based models have shown better performance in speech processing tasks. Therefore, we intend to explore a transformer-based SE model, which is fine-tuned through a two-stage training scheme. Pre-training is performed using a feature-level optimization criterion through SE loss, and then a pre-trained end-to-end ASR model is used to fine-tune the SE model using an ASR-oriented optimization criterion through SE and ASR losses. We evaluate the proposed approach on low-resourced Bengali language, which has not received as much attention as resource-rich English or Mandarin languages in both SE and ASR fields. Experimental results show that it can improve the performance of SE and ASR under severe unseen noisy conditions and its performance is reasonably good compared with other state-of-the-art SE methods.

[1]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Shadrokh Samavi,et al.  Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation , 2019, 2020 International Conference on Machine Vision and Image Processing (MVIP).

[3]  Pabitra Mitra,et al.  Bengali speech corpus for continuous auutomatic speech recognition system , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[4]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Li Chai,et al.  A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Shafayat Ahmed,et al.  Improving End-to-End Bangla Speech Recognition with Semi-supervised Training , 2020, FINDINGS.

[7]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[9]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[10]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[11]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[12]  Supheakmungkol Sarin,et al.  Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali , 2018, SLTU.

[13]  Full Softmax,et al.  One-pass single-channel noisy speech recognition using a combination of noisy and enhanced features , 2019 .

[14]  Tomohiro Nakatani,et al.  Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Saeed Gazor,et al.  An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[19]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[20]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[21]  Ryandhimas E. Zezario,et al.  Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[22]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Yu Tsao,et al.  Incorporating Broad Phonetic Information for Speech Enhancement , 2020, INTERSPEECH.

[28]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[30]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[31]  Hermann Ney,et al.  Investigation into Joint Optimization of Single Channel Speech Enhancement and Acoustic Modeling for Robust ASR , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).