Multiple-Hypothesis CTC-Based Semi-Supervised Adaptation of End-to-End Speech Recognition

This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hypotheses to the computation of the CTC loss when ASR hypotheses are used. When being applied in semi-supervised adaptation scenarios where part of the adaptation data do not have labels, the CTC loss of the proposed method is computed from different ASR 1-best hypotheses obtained by decoding the unlabeled adaptation data. Experiments are performed in clean and multi-condition training scenarios where the CTC-based end-to-end ASR systems are trained on Wall Street Journal (WSJ) clean training data and CHiME-4 multi-condition training data, respectively, and tested on Aurora-4 test data. The proposed adaptation method yields 6.6% and 5.8% relative word error rate (WER) reductions in clean and multi-condition training scenarios, respectively, compared to a baseline system which is adapted with part of the adaptation data having manual transcriptions using back-propagation fine-tuning.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[4]  Fabio Brugnara,et al.  Experiments on cross-system acoustic model adaptation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[5]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Shinji Watanabe,et al.  Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.

[7]  Takehiko Kagoshima,et al.  The Toshiba entry to the CHiME 2018 Challenge , 2018 .

[8]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[9]  Albert Y. S. Lam,et al.  Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  Satoshi Asakawa,et al.  End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System , 2019, INTERSPEECH.

[13]  Yifan Gong,et al.  Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Wu Guo,et al.  Adaptive Speaker Normalization for CTC-Based Speech Recognition , 2020, INTERSPEECH.

[16]  Thomas Hain,et al.  Selective Adaptation of End-to-End Speech Recognition using Hybrid CTC/Attention Architecture for Noise Robustness , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Shigeru Katagiri,et al.  Speaker Adaptation for Multichannel End-to-End Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  S. Renals,et al.  Adaptation Algorithms for Speech Recognition: An Overview , 2020, ArXiv.

[20]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[21]  Jonathan Le Roux,et al.  Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[23]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[24]  Yannis Stylianou,et al.  Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder , 2017, INTERSPEECH.

[25]  Yashesh Gaur,et al.  Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[26]  Cong-Thanh Do Subband Temporal Envelope Features and Data Augmentation for End-to-end Recognition of Distant Conversational Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[28]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.